Saturday, February 27, 2010

Best practice for open data, a reading list

It can be hard to convince data-sharers that data being "freely available" on a web-page isn't the end of the argument about whether that data can be reused. Here's a collection of postings I've found helpful in understanding the issues.
For comparison purposes, it is worth looking at the history of software repositories. For example, Debian has 20,000+ packages within it (depending how you count), covering every kind of software under the sun (and beyond, stargazers should check out the "stellarium" package). A typical package will depend on a dozen or so other packages, which in turn depend on others. It is a massive work of aggregation. There are huge technical challenges, but underlying the solution is the Debian social contract and their Free Software Guidelines. Here are the guidelines in full:

  1. Free Redistribution

    The license of a Debian component may not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license may not require a royalty or other fee for such sale.

  2. Source Code

    The program must include source code, and must allow distribution in source code as well as compiled form.

  3. Derived Works

    The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.

  4. Integrity of The Author's Source Code

    The license may restrict source-code from being distributed in modified form _only_ if the license allows the distribution of patch files with the source code for the purpose of modifying the program at build time. The license must explicitly permit distribution of software built from modified source code. The license may require derived works to carry a different name or version number from the original software. (This is a compromise. The Debian group encourages all authors not to restrict any files, source or binary, from being modified.)

  5. No Discrimination Against Persons or Groups

    The license must not discriminate against any person or group of persons.

  6. No Discrimination Against Fields of Endeavor

    The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

  7. Distribution of License

    The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties.

  8. License Must Not Be Specific to Debian

    The rights attached to the program must not depend on the program's being part of a Debian system. If the program is extracted from Debian and used or distributed without Debian but otherwise within the terms of the program's license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the Debian system.

  9. License Must Not Contaminate Other Software

    The license must not place restrictions on other software that is distributed along with the licensed software. For example, the license must not insist that all other programs distributed on the same medium must be free software.

  10. Example Licenses

    The GPL, BSD, and Artistic licenses are examples of licenses that we consider free.

There's a lot going on here, and not all of it applies to data. But the desired outcome could perhaps be summarized as:
  • We want stuff than can be collected, mixed, and redistributed to others, who can in turn do the same.
  • We only want stuff that expressly permits such activity, and where the owners don't have any requirements that make such activity excessively complicated.
While data and software are different, projects like the DCP (and many others) would like to see this same outcome achieved for data. A challenge is that there is so much data out there that is "freely available" on a random website, but doesn't expressly permit aggregation, or state any requirements on allowed use. And yet, almost certainly, there are implicit requirements (no commercial use! no use by organizations my members disapprove of!) which may or may not have legal force but certainly ought to be respected by a well-behaved aggregator, even if the only way to respect them without splitting up the "data commons" too much would be by omitting the data entirely.

No comments: