data

Fjárlagafrumvarpið í myndum

Fjárlagafrumvarp ársins 2009 var kynnt í dag. Frumvarpið sjálft er doðrantur sem fáir hafa undir höndum. Hægt er að lesa sig í gegnum frumvarpið á Fjárlagavef Fjármálaráðuneytisins, en það er ekki beinlínis aðgengilegt og ekki auðvelt að átta sig á stóru samhengi hlutanna.

DataMarket brást skjótt við og vann gögnin á meðfærilegara form. Útkoman er vefsvæði þar sem hægt er að sjá með þægilegum hætti í hvað stjórnvöld ætla að nota peningana okkar á komandi ári.

Á forsíðunni er yfirlit yfir ráðuneytin, raðað eftir útgjaldaröð. Með því að smella á súluna fyrir eitthvert ráðuneytið birtist skipting útgjalda þess og svo koll af kolli. Gögnin ná reyndar bara 3 þrep niður og oft langar mann að komast dýpra, en fjárlagafrumvarpið nær einfaldlega ekki lengra. Næsta þrep eru rekstraráætlanir einstakra stofnana og þær eru ekki fáanlegar að svo komnu máli.

Ég fullyrði að aldrei hefur verið jafnauðvelt að túlka, rýna og gagnrýna fjárlagafrumvarp á Íslandi eins og með þessu einfalda tóli. Ég bendi á að ef þið viljið efna til umræðu um einstök ráðuneyti eða rekstrarliði, þá á hvert þeirra sér sína slóð, sem hægt er að tengja beint á í bloggi eða senda tengil í tölvupósti eða á MSN.

Hér eru – sem dæmi – áætluð útgjöld Atvinnuleysistryggingasjóðs á komandi ári.

Skemmtið ykkur!

– – –

P.S. Þeir sem hafa áhuga á að komast í sjálf gögnin á einhverju formi sem leyfir frekari úrvinnslu (t.d. í Excel) eru hvattir til að setja sig í samband.

Myndræn framsetning gagna: Mannfjöldaþróun á Íslandi

Eins og nafnið gefur til kynna snýst DataMarket um öflun og miðlun hvers kyns gagna. Ég er þessvegna búinn að vera að velta mér mikið síðustu vikurnar uppúr allskyns gagnamálum og þeim möguleikum sem góð og aðgengileg gögn opna.

Myndræn framsetning er eitt af því sem getur gefið gögnum mjög aukið vægi og – ef vel tekst til – dregið fram staðreyndir sem annars eru faldar í talnasúpunni.

Ég gerði smá tilraun með mannfjöldagögn frá Hagstofunni. Byggt á tölum um aldurs- og kynjaskiptingu frá 1841 til dagsins í dag bjó ég til gagnvirka hreyfimynd sem sýnir hvernig mannfjöldapíramídinn (einnig þekkt sem aldurspíramídi) þróast á tímabilinu. Útkomuna má sjá með því að smella á myndina hér að neðan.

Smellið á myndina til að spila

 

Græni liturinn sýnir 18 ára og yngri, rauði liturinn 67 ára og eldri og guli liturinn þá sem eru þar á milli.

Myndin bendir á nokkrar áhugaverða hluti í mannfjöldaþróun Íslendinga:

  • Barnadauði: Fyrstu árin er sorglegt að sjá hvernig yngstu árgangarnir – sérstaklega sá allra yngsti – ná ekki að færast upp. Þetta segir sína sögu um barnadauða, aðbúnað barna og “heilsugæslu” þessa tíma.
  • “Baby boom”: Fram að lokum seinni heimstyrjaldarinnar vex og eldist þjóðin nokkuð jafnt og þétt. Reyndar dregur aðeins úr fæðingum framan af seinna stríði, en svo verður alger sprenging – það sem kallað er “baby boom” upp á ensku og á sér klárlega sína hliðstæðu hér. Þessi “barnasprengja” hefur verið rakin til aukinnar hagsældar, betra heilbrigðiskerfis og almennrar bjartsýni í kjölfar stríðsins. Uppúr 1960 jafnast svo stærð árganganna út aftur með tilkomu getnaðarvarna og skipulagðari barneignum en tíðkuðust fram að því.
  • Erlent vinnuafl: Síðasta sagan í þessum gögnum sýnir svo uppgangssveiflu síðustu ára. Ef þið skrefið ykkur í gegnum árin frá 2005-2008 (til þess eru örvatakkarnir) má sjá greinilega aukningu í aldurshópnum á bilinu 20-50 ára, sérstaklega karla megin. Aldurshópar geta eðli málsins ekki stækkað af náttúrulegum ástæðum (enginn fæðist 25 ára) þannig að þessi aukning stafar af aðfluttum umfram brottflutta. Þarna er líklega kominn hluti þess erlenda vinnuafls sem hingað hefur leitað í góðæri og framkvæmdum síðustu ára.

Sjálfsagt má lesa fleira út úr þessari mynd, en ég eftirlæt ykkur frekari greiningu 🙂

Örfá orð um tæknina

Myndin er gerð í stórskemmtilegu tóli sem nefnist Processing og gerir svona vinnslu tiltölulega einfalda. Hægt er að keyra bæði stakar myndir og vídeó út úr Processing, en til að ná fram gagnvirkni er keyrt út svokallað Java Applet. Sjálfur væri ég hrifnari af að sjá þetta sem Flash, þar sem stuðningur við það er almennari og útfærsla þess í vöfrunum á margan hátt skemmtilegri en Java (fljótara að ræsast, flöktir síður, o.fl.), en það verður ekki á allt kosið.

Allar hugmyndir, ábendingar og álit vel þegin.

Opin gögn – nýtt vefsvæði

Við erum, nokkrir félagar, að hleypa af stokkunum í dag nýju vefsvæði: opingogn.net

Það er hann Borgar sem fékk hugmyndina að þessu framtaki í vor eftir að ég flutti erindi um aðgengi að opinberum gögnum á hádegisverðarfundi sem Sjá og Marimo stóðu fyrir. Borgar setti upp vefinn og hýsir hann, en við Már höfum reynt að vera duglegir að hjálpa til.

Vefurinn er Wiki-vefur, þannig að við hvetjum sem flesta til að hjálpa til.

Ég ætla ekki að eyða mörgum orðum í tilgang vefsins hér enda á hann að útskýra sig sjálfur. Þeir sem reglulega lesa þetta blogg eru líka líklega búnir að fá nóg af boðskapnum:

Uppfært 11:41: Eyjan gerði Opnum Gögnum skil í morgunsárið.

The Case for Open Access to Public Sector Data

This article will be published tomorrow in The Reykjavík Grapevine.

Government institutions and other public organizations gather a lot of data. Some of them – like the Statistics Office – have it as their main purpose, others as a part of their function, and yet others almost as a by-product of their day-to-day operations.

In this case I’m mainly talking about structured data, i.e. statistics, databases, indexed registries and the like – in short, anything that could logically be represented in table format. This includes a variety of data, ranging from the federal budget and population statistics to dictionary words, weather observations and geographical coordinates of street addresses – to name just a few examples.

In these public data collections lies tremendous value. The data that has been collected for taxpayers’ money for decades or in a few cases even centuries (like population statistics) is a treasure trove of economical and social value. Yet, the state of public data is such that only a fraction of this value is being realized.

The reason is that accessing this data is often very hard. First of all its often hard to even find out what exists, as the sources are scattered, there is no central registry for existing data sets and many agencies don’t even publish information on the data that they have.

More worrying is that access to these data sets is made difficult by a number of restrictions, some accidental, other due to lack of funding to make them more accessible and some of these restrictions are even deliberate. These restrictions include license fees, proprietary or inadequate formats and unjustified legal complications.

I’d like to argue that any data gathered by a government organization should be made openly accessible online. Open access, means absence of all legal, technical and discriminating restrictions on the use or redistribution of data. A formal definition of Open Access can be found at opendefinition.org

The only exception to this rule should be when other interests – most importantly privacy issues – warrant access limitations.

There is a number of reasons for this. First of all, we (the taxpayers) have already paid for it, so it’s only logical that we can use the product we bought in any way we please. If gathering the relevant data and selling it can be a profitable business on its own, it should be done in the private sector, not by the government. Secondly it gives the public insight into the work done by our organizations in a similar way as Freedom of Information laws have done – mainly through media access to public sector documents and other information.

The most important argument – however – is that open access really pays off. Opening access and thereby getting the data in the hands of businesses, scientists, students and creative individuals will spur innovation and release value far beyond anything that a government organization can ever think of or would ever spend their limited resources on.

Some of these might be silly online games with little monetary value but yet highly entertaining. Others might be new scientific discoveries made when data from apparently unrelated data sources is mixed. And yet others might be rich visualizations that give new insights on some of the fundamental workings of society – showing where there’s need for attention and room for improvement.

A recent study on the state of matters with Public Sector data in the UK concluded that the lack of Open Access is costing the nation about 1 billion pounds annually in lost opportunities and lack of competition in various areas. Per capita, a billion pounds in the UK equals about 750 million ISK for Iceland and that’s without adjusting for Iceland’s higher GDP and arguably some fixed gains per nation.

Surely a huge opportunity for something that requires only a thoughtful policy change and a little budget adjustment to enable the institutions to make the needed changes and continue their great job of gathering valuable data.

– – –

See also an article from The Guardian, that helped spark a similar debate in the UK: Give us back our crown jewels

Adventures in copyright: Open access, data and wikis

I’ve just had a very interesting experience that sheds light on some important issues regarding copyright, online data and crowdsourced media such as wikis. I thought I’d share the story to spark a debate on these issues.

For a couple of years I’ve worked on and off on a simple web based system for maintaining and presenting a database of inflections of Icelandic words: Beygingarlýsing íslensks nútímamáls or “BÍN” for short. The data is available online, but the maintenance system is used by an employee of the official Icelandic language institute: Stofnun Árna Magnússonar í íslenskum fræðum. She has been gathering this data, and deriving the underlying structure for years, during a period spanning up to or over a decade. As you can imagine, BÍN is an invaluable source for a variety of things, ranging from foreigners learning Icelandic to the implementation of various language technology projects.

Now before I go any further I think it’s important to say that I’m a big supporter of open data. In fact, one of the few things I’ve ever gotten involved in actively lobbying for is open access to data in the public sector (article unfortunately in Icelandic).

Back to the story. A couple of days ago I got a call from the aforementioned BÍN administrator. She’d gotten a tip that someone was systematically copying data from BÍN into the Icelandic Wiktionary and asked me to look into it.

I started going through the web server log files – and sure enough – comparing the log files to the new entries page on Wiktionary, the pattern was obvious: A search for a word in BÍN and 2-3 minutes later a new entry in Wiktionary with that same word. A pattern consistent with someone copying the data by hand. This pattern went back a few days at least. Probably a lot longer.

In light of this I blocked access from the IP addresses that these search requests originated from and redirected them to a page that – in no uncertain terms – stated our suspicion of abuse and listed our email addresses in order for them to contact us for discussion.

Now – BÍN is clearly stated as copyrighted material – and as the right holder of the content, the institute has the full right to control the access to and usage of their data. Inflections of a word are obviously not intellectual property, but any significant part of a collection of nearly 260.000 such words definitely is.

As said before, I’ve personally been advocating for open access to all public sector data, but I also know that this is a complicated issue – far beyond the opinion of the people working with individual data sets. This institute – for example – must obey the rules set to them by the Ministry of Education, and changing those rules is something that must be taken up on an entirely different level.

The Wiktionary users in question have since contacted us and stated that they were not copying the content, merely referencing it when proofreading their own information. I have no reason to doubt that, but the usage pattern was indistinguishable from a manual copying process, leading to the suspicion and the blocking of their addresses.

We’ve since then exchanged several emails and hopefully we’ll find a way for all parties to work together. It would be fantastic if the enthusiasm and great work that is being put into building the Wiktionary could be joined with the professional experience and scientific approach exercised by the language institute to build a common source with clear and open access.

In the end of the day, open access to fundamental data like this will spur innovation and general prosperity, but as this story shows this is not something that will happen without mutual respect and consensus on the right way to move forward.

Updated Apr. 24: Discussion about this incident is also taking place here and here (both are at least partly in Icelandic).

Firehose aimed at a teacup

Dogbert to Dilbert: Information is gushing toward your brain like a firehose aimed at a teacup.

Every company, organization and individual is continuously gathering and creating all kinds of data. Most of this data collection is happening in separated silos, with very limited connections between the different data collections. This is true, even for data sources within the same organization. A shame really, because the value of the data rises in proportion with the ways it can be interlinked and connected – a network effect similar to the one that determines the value of social networks, telecommunication systems or even financial markets themselves.

Lack of common definitions, data schemes and meta-data, currently make these connections quite hard to make. This is the very problem that the semantic web promises to solve. However a lot of this data is already finding its way to the internet in one form or another, and those that make the effort to identify and collect the right bits can gain insights that give them competitive advantage in their markets. At – and around – the Money:Tech conference I was so fortunate to attend last week, several examples were given:

  • Stock traders are monitoring Amazon’s lists of top sales in electronics and using them as indicators of the performance of the chip maker’s performance in the market. This is done by breaking down the supply chain for each of the top selling devices and thereby establishing who’s benefiting from – say – the stellar sales of iPods.
  • Insurance companies, utilities (energy sector) and stock traders (again) are constantly analyzing weather data to predict things like insurance claims, electricity demand and retail sales patterns.
  • By monitoring in “real time” publicly available sales data in the real estate market, companies like Altos have been able to accurately predict housing price indexes up to a month before the official government numbers are published. Similar insights might be possible to predict other major economical indicators, such as matching the number of job listings on online sites to changes in the unemployment rate, or online retail prices to predict inflation.

But it’s not only data gathered from the internet that’s interesting. Far more data is dug deep in companies’ databases and therefore (usually for a good reason) not publicly available.

Take the area of telecommunications. Mining their data in new ways could help telcos and ISPs to get into areas currently dominated by other players. Take the sizzling hot social networking area as an example: Call data records, cell phone contact lists and email sending patterns are the quintessential social network information. Whom do I call and how frequently? Who calls me? Who is the real hub of information flow within my company? These are all information that can be read pretty directly from the data that a telco is already gathering. Every customer’s social network can be accurately drawn, including the strength, the “direction” and – to some degree – the nature of the relationship. Obviously this would have to be transparent to the customer and used only on a opt-in basis, but in terms of data accuracy it is a “Facebook killer” from day one.

This is clearly something Google would do if they were a telco (and I’m pretty sure that Android plays to this to some extent).

Another interesting aspect of this whole data collection business is the value of data that one company gathers to other organizations. Again, the telco is my example. A telco has – by default – information about the rough whereabouts of every mobile phone customer. Plot these on a time axis and remove all person identifying information and you have a perfect view of the flow of traffic and people through a city – including seasonal and periodical changes in the traffic patterns. This is of limited value to the telco, but imagine the value to city planners, businesses deciding where to build a service station or the opening hours of their high-street store. Add a little target group analysis on top of this and the results are almost scary.

We are probably at the very beginning of realizing the potential in the business of data exchange and data markets, but I’ll go as far as predicting that in the coming years we’ll see the rise of a new industry focusing solely on enabling and commercializing this kind of trade.

The world needs more people like him

This is a video of a presentation by Hans Rosling of Gapminder, the Swedish data visualisation organization that Google recently bought.

Rosling has a great vision: To make the world a better place by improving access to and presentation of data about the world, such as statistics about poor countries.

He makes the point very well in the presentation – which in turn is actually funny as hell and at times as exciting as a good sports broadcast – just a lot more important.

When you next think about sitting down to watch a brain-numbing sitcom, watch this instead. It’ll make you feel better.