Adventures in copyright: Open access, data and wikis

I’ve just had a very interesting experience that sheds light on some important issues regarding copyright, online data and crowdsourced media such as wikis. I thought I’d share the story to spark a debate on these issues.

For a couple of years I’ve worked on and off on a simple web based system for maintaining and presenting a database of inflections of Icelandic words: Beygingarlýsing íslensks nútímamáls or “BÍN” for short. The data is available online, but the maintenance system is used by an employee of the official Icelandic language institute: Stofnun Árna Magnússonar í íslenskum fræðum. She has been gathering this data, and deriving the underlying structure for years, during a period spanning up to or over a decade. As you can imagine, BÍN is an invaluable source for a variety of things, ranging from foreigners learning Icelandic to the implementation of various language technology projects.

Now before I go any further I think it’s important to say that I’m a big supporter of open data. In fact, one of the few things I’ve ever gotten involved in actively lobbying for is open access to data in the public sector (article unfortunately in Icelandic).

Back to the story. A couple of days ago I got a call from the aforementioned BÍN administrator. She’d gotten a tip that someone was systematically copying data from BÍN into the Icelandic Wiktionary and asked me to look into it.

I started going through the web server log files – and sure enough – comparing the log files to the new entries page on Wiktionary, the pattern was obvious: A search for a word in BÍN and 2-3 minutes later a new entry in Wiktionary with that same word. A pattern consistent with someone copying the data by hand. This pattern went back a few days at least. Probably a lot longer.

In light of this I blocked access from the IP addresses that these search requests originated from and redirected them to a page that – in no uncertain terms – stated our suspicion of abuse and listed our email addresses in order for them to contact us for discussion.

Now – BÍN is clearly stated as copyrighted material – and as the right holder of the content, the institute has the full right to control the access to and usage of their data. Inflections of a word are obviously not intellectual property, but any significant part of a collection of nearly 260.000 such words definitely is.

As said before, I’ve personally been advocating for open access to all public sector data, but I also know that this is a complicated issue – far beyond the opinion of the people working with individual data sets. This institute – for example – must obey the rules set to them by the Ministry of Education, and changing those rules is something that must be taken up on an entirely different level.

The Wiktionary users in question have since contacted us and stated that they were not copying the content, merely referencing it when proofreading their own information. I have no reason to doubt that, but the usage pattern was indistinguishable from a manual copying process, leading to the suspicion and the blocking of their addresses.

We’ve since then exchanged several emails and hopefully we’ll find a way for all parties to work together. It would be fantastic if the enthusiasm and great work that is being put into building the Wiktionary could be joined with the professional experience and scientific approach exercised by the language institute to build a common source with clear and open access.

In the end of the day, open access to fundamental data like this will spur innovation and general prosperity, but as this story shows this is not something that will happen without mutual respect and consensus on the right way to move forward.

Updated Apr. 24: Discussion about this incident is also taking place here and here (both are at least partly in Icelandic).


  1. “In the end of the day, open access to fundamental data like this will spur innovation and general prosperity”

    If this is true, and I agree that it is, why did you shut down the users and why did you accuse them “in no uncertain terms” of “abuse”? Looking through the Wiktionary logs it looks like we’re talking about a small handful of articles, maybe something like a couple a day. And they seem to explicitly link to Orðabók Háskólans. Here’s an example from April 10:

    Maybe I’m missing something. Is this one of the “abusive” entries you are talking about?

  2. I believe you’re missing some of the main points in my blog entry.

    “I” shut down the access because the rightholders of the content believed that abuse was taking place. They are in full right to do so.

    “No uncertain terms” was a reference to that the text I put up on the blocking page was maybe a little harsh – leaving little room for doubt that there was in fact abuse going on. Which I’ve later come to understand may not neccesarily be the case.

    The whole point of the entry – however – was that if we are to fight for open access to public sector data, we must do so at the right places. In this case not even the rightholders – let alone me – are at liberty to give open access to the data, even if we were so inclined. The data is under control of the Ministry of Education. These issues are more complicated than might meet the eye at first.

    The best way to ever get these issues solved is to advocate for a big policy change and understanding at the government level. This will take time and requires patience, but any run-ins – such as incidents of abuse or copying of data (again not referring to this case, but generally) will only make the cause more difficult.

    As for the example, you’re completely missing the fact that I’m no longer convinced that there was any abuse at all. If BÍN was used for reference, I think it should nevertheless be referred to in this article (and others) according to Wiktionary’s own guidelines.

  3. Well, thank you for clarifying. Reading your original post the impression I got was: “We thwarted some silly people who were abusing our stuff!” Now it seems that what you really mean was: “We overreacted to people who were doing something completely reasonable which they should definitely be allowed to do.”

    Yes, certainly, lobbying is needed. I’m less sure than you about this, though:

    “This will take time and requires patience, but any run-ins – such as incidents of abuse or copying of data (again not referring to this case, but generally) will only make the cause more difficult.”

    To demonstrate the utility of releasing data which the public has paid for into the public domain it may be necessary to show that the public is already using this data to good effect. This may include cases where the legal situation is unclear. What you’re saying is: “If the public were using the resources in question less then the powers that be would be more inclined to allow the public to use them.” Perhaps this is correct but it seems somewhat counterintuitive to me.

  4. What I really meant is probably somewhere between your two interpretations.

    Reading the first one from my text above, implies a prejudgment of my opinions on the matter. I tried my best to be clear that I both respect the work and opinions of those working on Wiktionary.

    In fact I think you may be underestimating how much I am on “your side” on this issue, and how much work is to be done to convince those that really decide on these issues that open access to public sector data is the way forward.

    On the last point we simply disagree on how to best change public policy and I don’t want to lead the conversation off topic to discuss that point.

  5. There is a lot of work to be done, we don’t disagree on that.

    Could you elaborate on the relevant Ministry of Education regulations which you refer to above?

  6. Hello there,

    I sent you an email earlier, this isn’t quite related to that, but it is to the current conversation tangentially. It is interesting to see how the Norwegian government deals with this. Both the Norsk Ordbank (full formed inflection lists — something like BÍN) for Nynorsk and Bokmål are available under the GPL:

    As is the Oslo-Bergen POS tagger based on Constraint Grammar:

    This is a much more desirable state of affairs for language technology, and while Norwegian isn’t an exemplar, it is certainly more along the rode than Icelandic and doing very well considering the size.


  7. Haukur: I believe that the issue with the Ministry of Education is irrelevant to the bigger picture (general open access to public sector data), but it certainly is to the particular example that spurred this discussion, so I’ll explain as far as my knowledge goes.

    You can read the background of BÍN on the BÍN website (in Icelandic).

    The Ministry of Education paid for the making of the BÍN database. This was a part of a special effort on language technology back in 2002. As such, the Ministry was the rightholder. In an agreement with the Icelandic Dictionary (now a part of Árnastofnun institute) in 2005, the institute became the rightholder. This agreement is what defines the access to the data. As I’ve understood, a part of that is what led to the current licensing model for BÍN.

    Francis: Thanks for the pointers. Very interesting. The Norwegians seem much more forward looking than e.g. the Danish model which I think the Icelandic one was based on. But as said before there are usually more complications behind all of these than meet the eye at first.

  8. This discussion is fascinating, and very, very timely.

    Francis’ comments on the (enviable) state of affairs in Norway are nicely complemented by Hjalli’s earlier blog post about efforts in the UK to bring open access to British government data. And, isn’t the US far ahead in this, too?

    A similar effort is still very much at the grassroots level here in Iceland, and it is important to understand who the players are – policies are set at the ministerial level, within the confines of copyright law as it stands, not by individual researchers or even research institutions. If we want more open data access we need to clarify the resulting benefits for *everyone* – business, academia and the general public alike.

    Otherwise we’re just stuck with the current situation of copyright disbelievers who own very little data and are happy to share it, vs. copyright true believers who own most or all of the data but see little apparent benefit in a more open environment.

    I’m convinced that the owners of the data stand to gain more than anyone else from opening their data sets, just like the British and the Norwegians seem to have concluded. But, exactly what are those benefits? The crux of the matter is showing that the pros outweigh the cons.

    Incidentally… AFAIK the maintainers of the BÍN data set are forced to react to patterns such as those evidenced in the BÍN and Wikipedia logs, there is no choice in the matter. They’re charged with this as a part of their job descriptions and would be delinquent if they did not react. In fact, Hjalli had no choice in the matter either – he could have dragged his feet, but only somewhat pointlessly. Eventually a letter from the lawyers at the Ministry of Education would have forced the issue.

  9. The Icelandic government has a history of funding various good projects and then implementing restrictions on usage which limit the realized value to a fraction of the potential.
    The point of government funded projects is to implement business cases which are not feasible for the private sector but are beneficial from a gross national standpoint.
    I therefore reason that if the government does finance projects that are feasible from a gross national standpoint they should be made publicly available to realize the maximum benefit, access should only be limited if there is need to prevent abuse or to pay for maintenance.

  10. Stefán: Kevin Scannell has a nice paper discussing the benefits for all parties, “Implementing NLP Projects for Non-Central Languages: Instructions for Funding Bodies, Strategies for Developers”, if you search for the title, you should be able to find a link to a PDF.

  11. I once wondered whether Iceland followed many civil law jurisdictions and the United States of America that government published works would be in public domain, but apperantly it isn’t.

    Iceland Copyrights Law on WIPO

    Hjalli, do you think something can be done to open the information?

  12. Jarl: I fundamentally agree with your position on this. I think the lacking access to public sector data in Iceland is not so much a strategy as a lack of one.

    I’m pretty sure that if it is decently presented, we can change that. There may be some gray areas or nuances, but the big picture is pretty straight forward once you start thinking about.

    28481k: I’m pretty convinced that we can bring about a policy about open access to public sector data. All in all I would like to keep the conversation on the high level and not about the BÍN dataset as such, even though it was what sparked the discussion.

  13. Hi Hjalli,

    Interesting discussion – and one that’s long overdue! As it sounds like you’re interested in taking this discussion to the people who can actually make changes, I’d like to draw your attention to a mini-rant I wrote: (sorry, Icelandic)

    No surprises there, probably, but it gives (IMHO) a very clear example of how limited access to government data is tangibly preventing innovation.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s