langtech

Adventures in copyright: Open access, data and wikis

I’ve just had a very interesting experience that sheds light on some important issues regarding copyright, online data and crowdsourced media such as wikis. I thought I’d share the story to spark a debate on these issues.

For a couple of years I’ve worked on and off on a simple web based system for maintaining and presenting a database of inflections of Icelandic words: Beygingarlýsing íslensks nútímamáls or “BÍN” for short. The data is available online, but the maintenance system is used by an employee of the official Icelandic language institute: Stofnun Árna Magnússonar í íslenskum fræðum. She has been gathering this data, and deriving the underlying structure for years, during a period spanning up to or over a decade. As you can imagine, BÍN is an invaluable source for a variety of things, ranging from foreigners learning Icelandic to the implementation of various language technology projects.

Now before I go any further I think it’s important to say that I’m a big supporter of open data. In fact, one of the few things I’ve ever gotten involved in actively lobbying for is open access to data in the public sector (article unfortunately in Icelandic).

Back to the story. A couple of days ago I got a call from the aforementioned BÍN administrator. She’d gotten a tip that someone was systematically copying data from BÍN into the Icelandic Wiktionary and asked me to look into it.

I started going through the web server log files – and sure enough – comparing the log files to the new entries page on Wiktionary, the pattern was obvious: A search for a word in BÍN and 2-3 minutes later a new entry in Wiktionary with that same word. A pattern consistent with someone copying the data by hand. This pattern went back a few days at least. Probably a lot longer.

In light of this I blocked access from the IP addresses that these search requests originated from and redirected them to a page that – in no uncertain terms – stated our suspicion of abuse and listed our email addresses in order for them to contact us for discussion.

Now – BÍN is clearly stated as copyrighted material – and as the right holder of the content, the institute has the full right to control the access to and usage of their data. Inflections of a word are obviously not intellectual property, but any significant part of a collection of nearly 260.000 such words definitely is.

As said before, I’ve personally been advocating for open access to all public sector data, but I also know that this is a complicated issue – far beyond the opinion of the people working with individual data sets. This institute – for example – must obey the rules set to them by the Ministry of Education, and changing those rules is something that must be taken up on an entirely different level.

The Wiktionary users in question have since contacted us and stated that they were not copying the content, merely referencing it when proofreading their own information. I have no reason to doubt that, but the usage pattern was indistinguishable from a manual copying process, leading to the suspicion and the blocking of their addresses.

We’ve since then exchanged several emails and hopefully we’ll find a way for all parties to work together. It would be fantastic if the enthusiasm and great work that is being put into building the Wiktionary could be joined with the professional experience and scientific approach exercised by the language institute to build a common source with clear and open access.

In the end of the day, open access to fundamental data like this will spur innovation and general prosperity, but as this story shows this is not something that will happen without mutual respect and consensus on the right way to move forward.

Updated Apr. 24: Discussion about this incident is also taking place here and here (both are at least partly in Icelandic).

Spurl launches an Icelandic search engine

Originally posted on the Spurl.net forums on November 1st 2005

Spurl.net just launched an Icelandic search engine with the leading local portal / news site mbl.is. Although Iceland is small, mbl.is is a nice, mid-size portal with some 210,000 unique visitors per week – so this has significant exposure.

The search engine is called Embla – it’s a pun on the portal’s name (e mbl a) and the name of the first woman created by the Gods in Norse Mythology.

An Icelandic search engine may seem a little off topic for many of you Spurl.net users, but it is relevant for several reasons:

  • We have been busy for some 4 months now, building the next version of Zniff. Embla uses this new code and this code will make its way into both Spurl.net (for searching your own spurls) and Zniff (for searching the rest of the Web) in the next weeks. The engine is now reliable, scalable, redundant and compliant with a lot of other .com buzzwords – with more relevant search results and subsecond response times on most queries.
  • This commercial arrangement with mbl.is (hopefully the first of many similar) gives us nice financial footing for further development of Spurl.net and related products.
  • Embla uses information from Spurl.net users. Even though only a small portion of our users are Icelandic, we’re using their information as one of the core elements for ranking the search results – and with good outcome. This strengthens our believe in the “human search engine” concept for other markets as well as in the international playing ground.

I plan to write more on the search engine itself on my blog soon, but just wanted to mention briefly that with Embla we’re also breaking new ground in another territory: Embla “knows” Icelandic. Most search engines and technologies are of English origin, and from a search standpoint, English is a very simple language. Most English words have only a couple of word forms (such as “house” and houses”) and while some search engines use stemming (at least sometimes), it doesn’t matter all that much for English. Many languages – including Icelandic – are far more complicated. Some words can – hypothetically – have more than 100 different word forms. In reality a noun will commonly have about 12-16 unique word forms. Now THIS really matters. The difference in the number of returned results is sometimes 6 to 10-fold, and it improves relevance as well.

We have built Embla so that it searches for all forms of the user’s search words. We also offer spelling corrections for Icelandic words, based on the same lexicon. The data for this comes from the Institute of Lexicography, University of Iceland, but the methodology and technology is ours all the way and can (and will) be used for other languages as well. Quite cool stuff actually – as mentioned, I will write more about that on the blog soon.