Thursday, April 23, 2009

Big Data

Out of curiosity, I've been looking at how some of the "big" sources of open data out there distribute their data. Wikipedia is perhaps the most famous. All the data on Wikipedia and related sites is available for bulk download. For example, the English section of Wikipedia is available here:
In other words, with a click or two you can end up with an XML file holding the basic content of all English Wikipedia pages. There are other XML and SQL files for other bits and pieces.

The DMOZ open directory (like Yahoo's directory, but volunteer created and under a free license) is downloadable in RDF format at http://rdf.dmoz.org/.

Of course, there's lots more data out there, but this does give a sense of one way in which "Big Data" may be distributed. What I like:
  • It is really easy to get the free data, just like it is easy to get free software.
  • The data is in a good format to use, just like free software source code.
  • Rights to the data are granted in a clear and free license, just like free software.
What I don't like:
  • There's no equivalent of "patches" in software. Let me explain. If you improve a piece of code someone else wrote, you can automatically generate the "difference" between the original and your revised version, send that difference (called a "patch") to the original author, who can then evaluate it and if they like it merge it automatically with their code (even if they've made their own non-overlapping changes in the meantime). That's patching in software. Now what happens if you improve pages you downloaded from Wikipedia? I guess you go to the site and try typing them in - there's no way I see to submit something like a patch. And without a patching mechanism, there's no basis for distributed development of the data, like happens with free software.
There are distributed databases out there. CouchDB is interesting, for example. But it would also be interesting to have a procedure for patching and merging data that operated on an external representation rather than on live databases.

Update: Nat Torkington has a post called Truly Open Data asking similar questions.

Update (2): I've been doing work on storing our own data in fossil, a distributed version control system. The trick, as I see it, is to bridge the gap between git/bzr/hg/fossil/... and programs like Excel that non-programmers keep their data in.

1 comment:

Unknown said...

Take a look at Freebase (www.freebase.com). It's not patching, per se, but provides a platform for distributed development on a centralized data environment.