I’ve just had a very interesting experience that sheds light on some important issues regarding copyright, online data and crowdsourced media such as wikis. I thought I’d share the story to spark a debate on these issues.
For a couple of years I’ve worked on and off on a simple web based system for maintaining and presenting a database of inflections of Icelandic words: Beygingarlýsing íslensks nútímamáls or “BÍN” for short. The data is available online, but the maintenance system is used by an employee of the official Icelandic language institute: Stofnun Árna Magnússonar í íslenskum fræðum. She has been gathering this data, and deriving the underlying structure for years, during a period spanning up to or over a decade. As you can imagine, BÍN is an invaluable source for a variety of things, ranging from foreigners learning Icelandic to the implementation of various language technology projects.
Now before I go any further I think it’s important to say that I’m a big supporter of open data. In fact, one of the few things I’ve ever gotten involved in actively lobbying for is open access to data in the public sector (article unfortunately in Icelandic).
Back to the story. A couple of days ago I got a call from the aforementioned BÍN administrator. She’d gotten a tip that someone was systematically copying data from BÍN into the Icelandic Wiktionary and asked me to look into it.
I started going through the web server log files – and sure enough – comparing the log files to the new entries page on Wiktionary, the pattern was obvious: A search for a word in BÍN and 2-3 minutes later a new entry in Wiktionary with that same word. A pattern consistent with someone copying the data by hand. This pattern went back a few days at least. Probably a lot longer.
In light of this I blocked access from the IP addresses that these search requests originated from and redirected them to a page that – in no uncertain terms – stated our suspicion of abuse and listed our email addresses in order for them to contact us for discussion.
Now – BÍN is clearly stated as copyrighted material – and as the right holder of the content, the institute has the full right to control the access to and usage of their data. Inflections of a word are obviously not intellectual property, but any significant part of a collection of nearly 260.000 such words definitely is.
As said before, I’ve personally been advocating for open access to all public sector data, but I also know that this is a complicated issue – far beyond the opinion of the people working with individual data sets. This institute – for example – must obey the rules set to them by the Ministry of Education, and changing those rules is something that must be taken up on an entirely different level.
The Wiktionary users in question have since contacted us and stated that they were not copying the content, merely referencing it when proofreading their own information. I have no reason to doubt that, but the usage pattern was indistinguishable from a manual copying process, leading to the suspicion and the blocking of their addresses.
We’ve since then exchanged several emails and hopefully we’ll find a way for all parties to work together. It would be fantastic if the enthusiasm and great work that is being put into building the Wiktionary could be joined with the professional experience and scientific approach exercised by the language institute to build a common source with clear and open access.
In the end of the day, open access to fundamental data like this will spur innovation and general prosperity, but as this story shows this is not something that will happen without mutual respect and consensus on the right way to move forward.
Updated Apr. 24: Discussion about this incident is also taking place here and here (both are at least partly in Icelandic).