Data Commoners: 2009

Thursday, October 15, 2009

Solidarity Economy Maps

The DCP has recently made contact with a few groups affiliated with the Solidarity Economy Network that are working on mapping the solidarity economy in their locations. These are greater Philadelphia, New York (with an emphasis on Brooklyn), and Monmouth County, New Jersey. Each group is at a different stage, but they are all eager to figure out how to make their efforts usable and shareable with others. Hence their interest in the Data Commons! Stay tuned -- one day these efforts will grow into what Ethan was describing in an earlier post on mapping solidarity economy networks.

Monday, June 29, 2009

Open Database License, version 1 is out

The Open Data Commons project (no relation) has released version 1.0 of its Open Database License (previously mentioned here). It is great to see someone with the legal savvy to put this together.

This license is an "Attribution and Share-Alike" license, doing for data/databases what the creative commons has done for media.

Thursday, June 4, 2009

DCP work in the latest version of Gnumeric

A code contribution from a DCP developer (previously mentioned here) to the Gnumeric spreadsheet package has been included in their latest official release, gnumeric-1.9.8. The patch helps merging multiple sheets in separate files into a single spreadsheet. It is intended to aid automatic generation of complex spreadsheet views of online data for easy download.

Thursday, May 21, 2009

data.gov goes live

The US Federal Government is putting some more of its data out there, at data.gov. Their full list of raw data at the time of writing comes to 47 "catalogs" with formats including XML, CSV, KML, ESRI, and a few RSS feeds. It is not very extensive at the moment compared to existing sources, but a good start. The site is clearly separated into the raw data catalog on one side, and the tools they have for exploring the data on the other. This means that third-parties could come along and make better tools with access to the same data, benefiting everyone. Within the data commons, we're following a similar model, where the data commons "repository" holds all the data, and our own online tools and API will not be privileged in any way. If someone can come along and visualize our data better than we can, then that would be great!

Monday, May 18, 2009

Why and How: Making Data Open

Our similarly-named-but-unrelated-peers, the Open Data Commons, have posted a guide to making your data open, briefly explaining why you might want to do it, and how to do it right.

Here's their take on why open data is important:

[Open data is] crucial because open data is so much easier to break-up and recombine, to use and reuse.

And their take on the need for clear licensing:

Licensing is important because it removes uncertainty. Without a license you don’t know where you, as a user, stand: when are you allowed to use this data? Are you allowed to give to others? To distribute your own changes, etc?

Tuesday, May 12, 2009

Generating spreadsheets online

One of the advantages we hope the data commons will bring to data sharers is that we'll handle all the tedious format conversion issues that often raise the cost of collaboration.

Smaller organizations often prefer to work with spreadsheets rather than databases, so we are working to support export and import to spreadsheet formats, such as that of Excel.

The free and open source spreadsheet program gnumeric comes with a command-line utility ssconvert ("SpreadSheet convert") which is almost ideal for automating such conversions. For example, it can take the primitive CSV (comma separated value) format which is easy to generate from any source, and convert it to all the various Excel formats (and lots of other formats too).

A patch from a data commoner (myself) to support merging multiple workbooks was recently accepted by the developers of gnumeric. This is typical of how the free and open source software community works: someone benefiting from a public good extends it to meet a need they have, then contributes the extension for the benefit of all. This model is at the heart of what the data commons project wants to bring to the cooperative economy.

Monday, May 11, 2009

A site to watch: Farm 2 Local

I just ran across Farm2Local, another budding start-up that aims to make it easier for farmers to find buyers and consumers to find fresh, local produce. Perhaps we can learn from them.

Saturday, May 9, 2009

Data Genius

"Don't be data rich and knowledge poor." Now that's a tagline I find intriguing. Living Naturally is one of NCGA's partner organizations, a "supplier of software and solutions to retailers in the natural products industry, including store automation." The sophistication of the data environment for some types of enterprises, including food co-ops, is a little intimidating for me. How can we make sure that what the Data Commons provides is of sufficient value and can integrate into the rest of their operations seamlessly?

Tuesday, May 5, 2009

Featured Directory: The .Coop Directory

Cooperatives are one of the few types of businesses to have their own Top-Level Domain suffix, .coop. The organization that grants these domain names has been improving its directory of listed organizations, as a service to domain holders and in an effort to popularize the .coop suffix.
Features that they are advertising:

.coop domain holders can Claim and Customize their listings
listings are geo-tagged, so you can search by geography
multimedia: you can add a photo, video and logo to your listing; you can also upgrade (pay?) to add more photos and videos, or a custom map or directory
sharing widgets to point people to the directory, such as via Facebook and Myspace

These seem like fairly good ideas. It will be fun to see if they want to become a Data Sharing Organization someday.

UPDATE: The .coop Directory uses a company's software to display the directory listings on a map. We're a little disappointed at the closed and proprietary nature of the software. Coops can do better!

Retailer-Supplier Data-sharing

I don't know if the Data Commons will ever get this specific. Here's a story we were alerted to about how a retailer (Food Lion) is working with a lot of its suppliers to exchange real-time (or nearly) data on sales, inventory, etc. This helps get a clearer idea of what the stores have, what customers are buying, when a product is out-of-stock, and so on. This kind of data may not be our primary focus at the moment, but who knows where we'll end up?

Saturday, May 2, 2009

SQLFairy schema translation

The SQLFairy project, which has one of the most ... unusual logos around, produces a nifty command-line tool called sqlt. This tool can map the structure of a database from one format to another, including SQL databases, Excel spreadsheets, comma-separated tables, etc. Here's the full list of formats it can read from ("Parsers") and write to ("Producers"):

Parsers: Access, DB2, DB2-Grammar, DBI, DBI-DB2, DBI-MySQL, DBI-Oracle, DBI-PostgreSQL, DBI-SQLServer, DBI-SQLite, DBI-Sybase, DBIx-Class, Excel, MySQL, Oracle, PostgreSQL, SQLServer, SQLite, Storable, Sybase, XML, XML-SQLFairy, YAML, xSV

Producers: ClassDBI, DB2, DBIx-Class-File, DiaUml, Diagram, Dumper, GraphViz, HTML, Latex, MySQL, Oracle, POD, PostgreSQL, SQLServer, SQLite, Storable, Sybase, TT-Base, TT-Table, TTSchema, XML, XML-SQLFairy, YAML

It translates the structure of data (in SQL: CREATE, ALTER) and not the data itself (in SQL: INSERT, UPDATE, DELETE). On Ubuntu/Debian systems, this tool is available in the "sqlfairy" package. Let's try a quick test on the following Excel spreadsheet (well, actually an OpenOffice spreadsheet, but saved in Excel format):

If we save this as sqlfairy.xls and run sqlt like this:

sqlt --from Excel sqlfairy.xls --to MySQL

we get:

--
-- Created by SQL::Translator::Producer::MySQL
-- Created on Sat May 2 14:30:57 2009
--
SET foreign_key_checks=0;

--
-- Table: `Accounts`
--
CREATE TABLE `Accounts` (
`Account` integer(3) NOT NULL DEFAULT '',
`First_Name` char(4) DEFAULT '',
`Last_Name` char(5) DEFAULT '',
`Balance` integer(4) DEFAULT '',
PRIMARY KEY (`Account`)
);

SET foreign_key_checks=1;

Not a bad start. If you're translating from a mysql database, you can either connect to it live using the DBI parser, or dump your database first like this:

mysqldump --user=USER --password=PASS DATABASE_NAME --lock-tables=false --no-data > dump.sql

And then convert it like this (here we convert to sqlfairy's own xml format):

sqlt --from MySQL dump.sql -t XML-SQLFairy > dump.xml

Thursday, April 30, 2009

Merging and maintaining bibliographic databases

Marcive is a company that offers "bibliographic services" such as deduplication of records after multiple libraries place their records into a shared catalog, reclassification from Dewey Decimal to Library of Congress classification, enrichment of brief records, etc. Their database cleanup brochure makes fascinating reading (well, if you've been spending way to much time thinking about pooling disparate databases).

There is a definite parallel between maintaining and merging library catalogs, and what the data commons aims to do. Wish we had an equivalent of the Library of Congress...

All fields become optional, all relationships many-to-many

Some wisdom on database app aging:

All Fields Become Optional: As your dataset grows, exceptions creep in. There’s not enough research time to fill in all your company profiles, there’s one guy in Guam when you expected everyone to be in a U.S. state, there’s data missing from the page you’re scraping, you have to pull updates from a new source...
All Relationships Become Many-to-Many: Some guy works in DC but lives in Virginia, so he needs two Locations. A new type of incoming email needs to be shoveled out to different feeds. A state has both a primary and a caucus. Someone eventually realizes categories never really were mutually exclusive...

Important to remember as we try to build a data commons that will last a long, long time.

Use Case: Aggravation

Working on a project today, I came across the type of headache that I want the Data Commons to solve.

Two different ways of writing the company name
Two different ways of writing the same address, and a third address of unknown reliability
A misspelling of the town name
The perennial East Coast problem of having to tell Excel that zip codes starting in "zero" need to be treated as text
No entries in the third column, "Title" -- and the probability that any listed titles could already be out of date
The sense of futility that fixing these problems once will not mean they are fixed forever

It's comforting, in a way, to know that at least I need the Data Commons to exist to make my life easier.

Open Database License, new draft out

The Open Data Commons (no relation) has a new draft of their Open Database License out (v1.0 release candidate). From its preamble:

The Open Database Licence (ODbL) is a licence agreement intended to allow users to freely share, modify, and use this Database while maintaining this same freedom for others. Many databases are covered by copyright, and therefore this document licenses these rights. Some jurisdictions, mainly in the European Union, have specific rights that cover databases, and so the ODbL addresses these rights, too. Finally, the ODbL is also an agreement in contract for users of this Database to act in certain ways in return for accessing this Database.

The license does a good job of using copyright to maintain freedoms, as free software licenses do. It does not address "privacy rights / data protection rights over information in the contents." I wonder whether such rights could be used, like copyright, as a means to maintain data freedom? In other words, as part of the privacy / data protection terms, agreeing to maintain freedom would be a requirement.

Monday, April 27, 2009

A Shared Directory of Local Food

While browsing the Food Routes website, which is dedicated to promoting local food buying, I ran across this description of their database:

In collaboration with eatwellguide.org, FoodRoutes brings you an online map that can help you find locally-produced food near you. This map combines multiple directories from organizations around the nation into one powerful database. In the directory, you'll find descriptions, phone numbers, addresses, web sites, crop lists, and directions all to make local food purchasing that much easier.

I would love to know how they combine all those directories and keep them up to date, and whether the Data Commons Project can help them do that better.

Thursday, April 23, 2009

An Example of a Shared Repository

I just ran across the Fedora Commons, "the home of the unique Fedora open source software, a robust integrated repository-centered platform that enables the storage, access and management of virtually any kind of digital content." The Commons is a non-profit organization "whose purpose is to provide sustainable open-source technologies to help individuals and organizations create, manage, publish, share, and preserve digital content upon which we form our intellectual, scientific, and cultural heritage." It continues the mission of the Fedora Project, which evolved from the Flexible Extensible Digital Object Repository Architecture (Fedora) developed by researchers at Cornell Computing and Information Science. It looks like what they do for libraries, etc. is very similar to what we want to do with directories. Paul has some specific ideas about why the way in which they do it is not the way in which we want to do it, but I'll let him explain.

I'm interested in their funding sources. Two years ago, they got a 4-year, $4.9 million grant from the Gordon and Betty Moore Foundation. "The Gordon and Betty Moore Foundation, established in 2000, seeks to advance environmental conservation and cutting-edge scientific research around the world and improve the quality of life in the San Francisco Bay Area. The Foundation’s Science Program seeks to make a significant impact on the development of provocative, transformative scientific research, and increase knowledge in emerging fields." Could we learn something from Fedora's application for the grant?

Big Data

Out of curiosity, I've been looking at how some of the "big" sources of open data out there distribute their data. Wikipedia is perhaps the most famous. All the data on Wikipedia and related sites is available for bulk download. For example, the English section of Wikipedia is available here:

All saved versions: http://download.wikimedia.org/enwiki/
Latest files: http://download.wikimedia.org/enwiki/latest/
The main file is "enwiki-latest-pages-articles.xml.bz2"

In other words, with a click or two you can end up with an XML file holding the basic content of all English Wikipedia pages. There are other XML and SQL files for other bits and pieces.

The DMOZ open directory (like Yahoo's directory, but volunteer created and under a free license) is downloadable in RDF format at http://rdf.dmoz.org/.

Of course, there's lots more data out there, but this does give a sense of one way in which "Big Data" may be distributed. What I like:

It is really easy to get the free data, just like it is easy to get free software.
The data is in a good format to use, just like free software source code.
Rights to the data are granted in a clear and free license, just like free software.

What I don't like:

There's no equivalent of "patches" in software. Let me explain. If you improve a piece of code someone else wrote, you can automatically generate the "difference" between the original and your revised version, send that difference (called a "patch") to the original author, who can then evaluate it and if they like it merge it automatically with their code (even if they've made their own non-overlapping changes in the meantime). That's patching in software. Now what happens if you improve pages you downloaded from Wikipedia? I guess you go to the site and try typing them in - there's no way I see to submit something like a patch. And without a patching mechanism, there's no basis for distributed development of the data, like happens with free software.

There are distributed databases out there. CouchDB is interesting, for example. But it would also be interesting to have a procedure for patching and merging data that operated on an external representation rather than on live databases.

Update: Nat Torkington has a post called Truly Open Data asking similar questions.

Update (2): I've been doing work on storing our own data in fossil, a distributed version control system. The trick, as I see it, is to bridge the gap between git/bzr/hg/fossil/... and programs like Excel that non-programmers keep their data in.

Wednesday, April 22, 2009

Products and Services

In a previous post, I mentioned that we are looking for "early adopters" to provide feedback on our proposed products and services. Here is a quick description of what the Data Commons Project is and what are these products and services I'm talking about:
The DCP has grown out of previous projects, including the Grassroots Economic Organizing "Economy of Hope" directory and the Regional Index of Cooperation (www.find.coop). The idea initially was simply to build an open, shared, comprehensive and accurate catalogue of the (small-c) cooperative economy -- including coops and credit unions, but also land trusts, local currencies, employee-owned companies, community-supported agriculture, and so on. This effort is linking up with similar efforts worldwide of cataloguing the cooperative/solidarity/social economy. (There are different terms, none of which is completely satisfactory -- we've been playing with using the term "rooted" economy. More on that later.)
What we would like to build is:
1. a big repository of data: a database with names of companies, organizations, and individuals, contact info, descriptions of their products/services, etc.
2. a website that is one of many clients* of the big repository, that would display the info, permit searches, display results on an interactive map, maybe provide more value-added reports for a fee, etc.
3. a set of tools & protocols that allow fast, efficient, relatively easy merging and cleaning of data to and from the repository.

* Other clients would be members of the Data Commons Cooperative, and as such, subscribers to periodic updates from the repository, as well as suppliers of their own updates back to the repository (somewhat in the style of the AP news story cooperative, where members both contribute and use stories, and also non-members can sign up to just be users for a fee).

The idea here is that if an organization changes its contact info, say, or has an announcement, then it would be great if that change were picked up in one place and broadcast out to all
the places that have an interest in it, instead of piecemeal as each place goes through and updates their database, or that organization having to somehow contact everyone to tell them about their changes.

For example, a new worker coop formed in Western Mass. would be of interest to at least the Valley Alliance of Worker Coops, the ECWD, the USFWC, NCBA, NASCO, CFNE, SEN, GEO, MASSEIO, any industry or sector-specific networks that it might belong to, etc. etc., not to mention potential clients or suppliers. So, without getting too long-winded about all the possibilities, that's what we'd like to create, in a way that actually creates value for our users and would be financially sustainable.

Early Adopters

We're reaching out to organizations to become "early adopters" of our services (along with the organizations already on board: CDI, USFWC, NASCO, and GEO). This would mean investing about 5 hours/month, as an organization, for a year. Different people could rotate around the responsibility of talking with us. Mainly what we would want is feedback, information, ideas, concerns on what you are doing now with information that could be made easier if the maintenance were shared with other organizations, what technology you are using and how best to interface with it to make your lives easier and save you time, money and aggravation, what you think of the ideas we have on what we could do and how to do it, new ideas about data use and presentation that would enable novel business opportunities, and so on.

Since we are gearing up to become a consumer-owned coop, this early "sweat equity" investment would count for most or all of the equity investment to join as a member of the coop. And the better feedback we get, the better we make our products and services, the more valuable being a member is.

Monday, April 13, 2009

Open Everything NYC

Open Everything NYC is coming up on April 18, 2009. From http://openeverything.net:

Open Everything is a global conversation about the art, science and spirit of 'open'. It gathers people using openness to create and improve software, education, media, philanthropy, architecture, neighbourhoods, workplaces and the society we live in: everything. It's about thinking, doing and being open.

From johndbritton:

Open Everything NYC will take place on Saturday 18 April 2009 at the UNICEF headquarters in the United Nations Plaza, NYC. The event will run the full day, registration will open at 8:00AM and things will be in full swing by 9:00AM.

The event will be 100% free and open to the public on a first come first serve basis, online pre-registration is required. The main hall can hold up to 250 guests.

The event will consist of two keynote presentations (one opening & one closing) each of about 1 hour in duration. In the time between the two keynotes attendees will be in control of the program (Barcamp style). There will be a number of conference rooms available for individuals to hold talks & discussions on topics they see fit. Past events have included topics such as Open Publishing, Open Education, Government Transparency, Open Access, Open Research Data, Creative Commons, Open Hardware, and more.

Good to see conversations like this happening! It seems like the theme of openness is cropping up more and more in every field, but the opportunities for communicating between those fields are few and far between.

Tuesday, April 7, 2009

International Mapping & Database Projects

We've got a lot to learn from directory efforts in other countries. Two examples stand out, in particular: Brazil and Quebec.

The Brazilian solidarity economy directory (called "Solidarius") has been in development since 2005 and, after two phases of participatory "mapping" of enterprises, now lists over 22,000 initiatives and is developing powerful information tech features to increase the usefulness of its database to grassroots economic movements.

Here's the link to Solidarius.

This directory includes basic and advanced search features, a virtual marketplace for solidarity economy products and services, educational and informational resources, a network-building facilitation feature, and an integrated social currency that facilitates exchange among solidarity economy consumers and producers.

The software (which we need to learn more about in terms of technical specifications) is, as far as I understand, open source.

The Quebec database is another informative project, though my lack of French makes it difficult to fully explore. Here's the link. One aspect of this project is that is it a "portal" rather than simply a directory. They intend to create a kind of "one stop shopping" for information on the "social economy." The directory, then, is placed in the context of news, job offers, events postings, and an online commerce feature.

The Brazilian database developers are currently working with those in Quebec on a system through which both databases would "talk to each other." I don't know the details of this project.

Mapping Solidarity Economy Networks

This is the first in a series of posts that will add elements of past DCP research and brainstorm ideas to this blog.

Today: the concept of "mapping networks." One of the potentially powerful applications of a comprehensive relational database of cooperative/solidarity economy initiatives is that we could begin to "map" the concrete economic relationships between them--supply chains, product distribution routes and markets. A dynamic analysis of these relationships could allow SE enterprises to visually understand new possibilities for building economic relationships across sectors and geographical regions. We would be able to understand where our relational strengths lie, be able to map the patterns of the networks to better understand their topology, and see where the "holes" are that could be filled.

This is a way of thinking that works effectively for some capitalist firms and production networks, so why not for solidarity networks?

Here are some links:

Monday, April 6, 2009

First Post: We got some money!

Many thanks to Equal Exchange, the worker-owned cooperative fair trade company, for giving the Data Commons Project our first big grant to fund development of the Data Commons Cooperative. They give some charitable contributions to further the creation of a cooperative economy. And we want to help build the infrastructure through which people can participate in cooperative, solidarity, sustainable, and fair trade exchanges. So it's a good match. And now we are going to buckle down and get to work!

We also don't want to forget all of the individual and organizational donors who gave to the DCP during our fundraising campaign last year. Thanks for believing in us. We are on the job.