Data Commoners: April 2009

Thursday, April 30, 2009

Merging and maintaining bibliographic databases

Marcive is a company that offers "bibliographic services" such as deduplication of records after multiple libraries place their records into a shared catalog, reclassification from Dewey Decimal to Library of Congress classification, enrichment of brief records, etc. Their database cleanup brochure makes fascinating reading (well, if you've been spending way to much time thinking about pooling disparate databases).

There is a definite parallel between maintaining and merging library catalogs, and what the data commons aims to do. Wish we had an equivalent of the Library of Congress...

All fields become optional, all relationships many-to-many

Some wisdom on database app aging:

All Fields Become Optional: As your dataset grows, exceptions creep in. There’s not enough research time to fill in all your company profiles, there’s one guy in Guam when you expected everyone to be in a U.S. state, there’s data missing from the page you’re scraping, you have to pull updates from a new source...
All Relationships Become Many-to-Many: Some guy works in DC but lives in Virginia, so he needs two Locations. A new type of incoming email needs to be shoveled out to different feeds. A state has both a primary and a caucus. Someone eventually realizes categories never really were mutually exclusive...

Important to remember as we try to build a data commons that will last a long, long time.

Use Case: Aggravation

Working on a project today, I came across the type of headache that I want the Data Commons to solve.

Two different ways of writing the company name
Two different ways of writing the same address, and a third address of unknown reliability
A misspelling of the town name
The perennial East Coast problem of having to tell Excel that zip codes starting in "zero" need to be treated as text
No entries in the third column, "Title" -- and the probability that any listed titles could already be out of date
The sense of futility that fixing these problems once will not mean they are fixed forever

It's comforting, in a way, to know that at least I need the Data Commons to exist to make my life easier.

Open Database License, new draft out

The Open Data Commons (no relation) has a new draft of their Open Database License out (v1.0 release candidate). From its preamble:

The Open Database Licence (ODbL) is a licence agreement intended to allow users to freely share, modify, and use this Database while maintaining this same freedom for others. Many databases are covered by copyright, and therefore this document licenses these rights. Some jurisdictions, mainly in the European Union, have specific rights that cover databases, and so the ODbL addresses these rights, too. Finally, the ODbL is also an agreement in contract for users of this Database to act in certain ways in return for accessing this Database.

The license does a good job of using copyright to maintain freedoms, as free software licenses do. It does not address "privacy rights / data protection rights over information in the contents." I wonder whether such rights could be used, like copyright, as a means to maintain data freedom? In other words, as part of the privacy / data protection terms, agreeing to maintain freedom would be a requirement.

Monday, April 27, 2009

A Shared Directory of Local Food

While browsing the Food Routes website, which is dedicated to promoting local food buying, I ran across this description of their database:

In collaboration with eatwellguide.org, FoodRoutes brings you an online map that can help you find locally-produced food near you. This map combines multiple directories from organizations around the nation into one powerful database. In the directory, you'll find descriptions, phone numbers, addresses, web sites, crop lists, and directions all to make local food purchasing that much easier.

I would love to know how they combine all those directories and keep them up to date, and whether the Data Commons Project can help them do that better.

Thursday, April 23, 2009

An Example of a Shared Repository

I just ran across the Fedora Commons, "the home of the unique Fedora open source software, a robust integrated repository-centered platform that enables the storage, access and management of virtually any kind of digital content." The Commons is a non-profit organization "whose purpose is to provide sustainable open-source technologies to help individuals and organizations create, manage, publish, share, and preserve digital content upon which we form our intellectual, scientific, and cultural heritage." It continues the mission of the Fedora Project, which evolved from the Flexible Extensible Digital Object Repository Architecture (Fedora) developed by researchers at Cornell Computing and Information Science. It looks like what they do for libraries, etc. is very similar to what we want to do with directories. Paul has some specific ideas about why the way in which they do it is not the way in which we want to do it, but I'll let him explain.

I'm interested in their funding sources. Two years ago, they got a 4-year, $4.9 million grant from the Gordon and Betty Moore Foundation. "The Gordon and Betty Moore Foundation, established in 2000, seeks to advance environmental conservation and cutting-edge scientific research around the world and improve the quality of life in the San Francisco Bay Area. The Foundation’s Science Program seeks to make a significant impact on the development of provocative, transformative scientific research, and increase knowledge in emerging fields." Could we learn something from Fedora's application for the grant?

Big Data

Out of curiosity, I've been looking at how some of the "big" sources of open data out there distribute their data. Wikipedia is perhaps the most famous. All the data on Wikipedia and related sites is available for bulk download. For example, the English section of Wikipedia is available here:

All saved versions: http://download.wikimedia.org/enwiki/
Latest files: http://download.wikimedia.org/enwiki/latest/
The main file is "enwiki-latest-pages-articles.xml.bz2"

In other words, with a click or two you can end up with an XML file holding the basic content of all English Wikipedia pages. There are other XML and SQL files for other bits and pieces.

The DMOZ open directory (like Yahoo's directory, but volunteer created and under a free license) is downloadable in RDF format at http://rdf.dmoz.org/.

Of course, there's lots more data out there, but this does give a sense of one way in which "Big Data" may be distributed. What I like:

It is really easy to get the free data, just like it is easy to get free software.
The data is in a good format to use, just like free software source code.
Rights to the data are granted in a clear and free license, just like free software.

What I don't like:

There's no equivalent of "patches" in software. Let me explain. If you improve a piece of code someone else wrote, you can automatically generate the "difference" between the original and your revised version, send that difference (called a "patch") to the original author, who can then evaluate it and if they like it merge it automatically with their code (even if they've made their own non-overlapping changes in the meantime). That's patching in software. Now what happens if you improve pages you downloaded from Wikipedia? I guess you go to the site and try typing them in - there's no way I see to submit something like a patch. And without a patching mechanism, there's no basis for distributed development of the data, like happens with free software.

There are distributed databases out there. CouchDB is interesting, for example. But it would also be interesting to have a procedure for patching and merging data that operated on an external representation rather than on live databases.

Update: Nat Torkington has a post called Truly Open Data asking similar questions.

Update (2): I've been doing work on storing our own data in fossil, a distributed version control system. The trick, as I see it, is to bridge the gap between git/bzr/hg/fossil/... and programs like Excel that non-programmers keep their data in.

Wednesday, April 22, 2009

Products and Services

In a previous post, I mentioned that we are looking for "early adopters" to provide feedback on our proposed products and services. Here is a quick description of what the Data Commons Project is and what are these products and services I'm talking about:
The DCP has grown out of previous projects, including the Grassroots Economic Organizing "Economy of Hope" directory and the Regional Index of Cooperation (www.find.coop). The idea initially was simply to build an open, shared, comprehensive and accurate catalogue of the (small-c) cooperative economy -- including coops and credit unions, but also land trusts, local currencies, employee-owned companies, community-supported agriculture, and so on. This effort is linking up with similar efforts worldwide of cataloguing the cooperative/solidarity/social economy. (There are different terms, none of which is completely satisfactory -- we've been playing with using the term "rooted" economy. More on that later.)
What we would like to build is:
1. a big repository of data: a database with names of companies, organizations, and individuals, contact info, descriptions of their products/services, etc.
2. a website that is one of many clients* of the big repository, that would display the info, permit searches, display results on an interactive map, maybe provide more value-added reports for a fee, etc.
3. a set of tools & protocols that allow fast, efficient, relatively easy merging and cleaning of data to and from the repository.

* Other clients would be members of the Data Commons Cooperative, and as such, subscribers to periodic updates from the repository, as well as suppliers of their own updates back to the repository (somewhat in the style of the AP news story cooperative, where members both contribute and use stories, and also non-members can sign up to just be users for a fee).

The idea here is that if an organization changes its contact info, say, or has an announcement, then it would be great if that change were picked up in one place and broadcast out to all
the places that have an interest in it, instead of piecemeal as each place goes through and updates their database, or that organization having to somehow contact everyone to tell them about their changes.

For example, a new worker coop formed in Western Mass. would be of interest to at least the Valley Alliance of Worker Coops, the ECWD, the USFWC, NCBA, NASCO, CFNE, SEN, GEO, MASSEIO, any industry or sector-specific networks that it might belong to, etc. etc., not to mention potential clients or suppliers. So, without getting too long-winded about all the possibilities, that's what we'd like to create, in a way that actually creates value for our users and would be financially sustainable.

Early Adopters

We're reaching out to organizations to become "early adopters" of our services (along with the organizations already on board: CDI, USFWC, NASCO, and GEO). This would mean investing about 5 hours/month, as an organization, for a year. Different people could rotate around the responsibility of talking with us. Mainly what we would want is feedback, information, ideas, concerns on what you are doing now with information that could be made easier if the maintenance were shared with other organizations, what technology you are using and how best to interface with it to make your lives easier and save you time, money and aggravation, what you think of the ideas we have on what we could do and how to do it, new ideas about data use and presentation that would enable novel business opportunities, and so on.

Since we are gearing up to become a consumer-owned coop, this early "sweat equity" investment would count for most or all of the equity investment to join as a member of the coop. And the better feedback we get, the better we make our products and services, the more valuable being a member is.

Monday, April 13, 2009

Open Everything NYC

Open Everything NYC is coming up on April 18, 2009. From http://openeverything.net:

Open Everything is a global conversation about the art, science and spirit of 'open'. It gathers people using openness to create and improve software, education, media, philanthropy, architecture, neighbourhoods, workplaces and the society we live in: everything. It's about thinking, doing and being open.

From johndbritton:

Open Everything NYC will take place on Saturday 18 April 2009 at the UNICEF headquarters in the United Nations Plaza, NYC. The event will run the full day, registration will open at 8:00AM and things will be in full swing by 9:00AM.

The event will be 100% free and open to the public on a first come first serve basis, online pre-registration is required. The main hall can hold up to 250 guests.

The event will consist of two keynote presentations (one opening & one closing) each of about 1 hour in duration. In the time between the two keynotes attendees will be in control of the program (Barcamp style). There will be a number of conference rooms available for individuals to hold talks & discussions on topics they see fit. Past events have included topics such as Open Publishing, Open Education, Government Transparency, Open Access, Open Research Data, Creative Commons, Open Hardware, and more.

Good to see conversations like this happening! It seems like the theme of openness is cropping up more and more in every field, but the opportunities for communicating between those fields are few and far between.

Tuesday, April 7, 2009

International Mapping & Database Projects

We've got a lot to learn from directory efforts in other countries. Two examples stand out, in particular: Brazil and Quebec.

The Brazilian solidarity economy directory (called "Solidarius") has been in development since 2005 and, after two phases of participatory "mapping" of enterprises, now lists over 22,000 initiatives and is developing powerful information tech features to increase the usefulness of its database to grassroots economic movements.

Here's the link to Solidarius.

This directory includes basic and advanced search features, a virtual marketplace for solidarity economy products and services, educational and informational resources, a network-building facilitation feature, and an integrated social currency that facilitates exchange among solidarity economy consumers and producers.

The software (which we need to learn more about in terms of technical specifications) is, as far as I understand, open source.

The Quebec database is another informative project, though my lack of French makes it difficult to fully explore. Here's the link. One aspect of this project is that is it a "portal" rather than simply a directory. They intend to create a kind of "one stop shopping" for information on the "social economy." The directory, then, is placed in the context of news, job offers, events postings, and an online commerce feature.

The Brazilian database developers are currently working with those in Quebec on a system through which both databases would "talk to each other." I don't know the details of this project.

Mapping Solidarity Economy Networks

This is the first in a series of posts that will add elements of past DCP research and brainstorm ideas to this blog.

Today: the concept of "mapping networks." One of the potentially powerful applications of a comprehensive relational database of cooperative/solidarity economy initiatives is that we could begin to "map" the concrete economic relationships between them--supply chains, product distribution routes and markets. A dynamic analysis of these relationships could allow SE enterprises to visually understand new possibilities for building economic relationships across sectors and geographical regions. We would be able to understand where our relational strengths lie, be able to map the patterns of the networks to better understand their topology, and see where the "holes" are that could be filled.

This is a way of thinking that works effectively for some capitalist firms and production networks, so why not for solidarity networks?

Here are some links:

Monday, April 6, 2009

First Post: We got some money!

Many thanks to Equal Exchange, the worker-owned cooperative fair trade company, for giving the Data Commons Project our first big grant to fund development of the Data Commons Cooperative. They give some charitable contributions to further the creation of a cooperative economy. And we want to help build the infrastructure through which people can participate in cooperative, solidarity, sustainable, and fair trade exchanges. So it's a good match. And now we are going to buckle down and get to work!

We also don't want to forget all of the individual and organizational donors who gave to the DCP during our fundraising campaign last year. Thanks for believing in us. We are on the job.