- All saved versions: http://download.wikimedia.org/enwiki/
- Latest files: http://download.wikimedia.org/enwiki/latest/
- The main file is "enwiki-latest-pages-articles.xml.bz2"
The DMOZ open directory (like Yahoo's directory, but volunteer created and under a free license) is downloadable in RDF format at http://rdf.dmoz.org/.
Of course, there's lots more data out there, but this does give a sense of one way in which "Big Data" may be distributed. What I like:
- It is really easy to get the free data, just like it is easy to get free software.
- The data is in a good format to use, just like free software source code.
- Rights to the data are granted in a clear and free license, just like free software.
- There's no equivalent of "patches" in software. Let me explain. If you improve a piece of code someone else wrote, you can automatically generate the "difference" between the original and your revised version, send that difference (called a "patch") to the original author, who can then evaluate it and if they like it merge it automatically with their code (even if they've made their own non-overlapping changes in the meantime). That's patching in software. Now what happens if you improve pages you downloaded from Wikipedia? I guess you go to the site and try typing them in - there's no way I see to submit something like a patch. And without a patching mechanism, there's no basis for distributed development of the data, like happens with free software.
Update: Nat Torkington has a post called Truly Open Data asking similar questions.
Update (2): I've been doing work on storing our own data in fossil, a distributed version control system. The trick, as I see it, is to bridge the gap between git/bzr/hg/fossil/... and programs like Excel that non-programmers keep their data in.
1 comment:
Take a look at Freebase (www.freebase.com). It's not patching, per se, but provides a platform for distributed development on a centralized data environment.
Post a Comment