Author: Hjalmar Gislason

About Hjalmar Gislason

Founder and CEO of GRID (https://grid.is/). Curious about data, technology, media, the universe and everything. Founder of 5 software companies.

The Case for Open Access to Public Sector Data

This article will be published tomorrow in The Reykjavík Grapevine.

Government institutions and other public organizations gather a lot of data. Some of them – like the Statistics Office – have it as their main purpose, others as a part of their function, and yet others almost as a by-product of their day-to-day operations.

In this case I’m mainly talking about structured data, i.e. statistics, databases, indexed registries and the like – in short, anything that could logically be represented in table format. This includes a variety of data, ranging from the federal budget and population statistics to dictionary words, weather observations and geographical coordinates of street addresses – to name just a few examples.

In these public data collections lies tremendous value. The data that has been collected for taxpayers’ money for decades or in a few cases even centuries (like population statistics) is a treasure trove of economical and social value. Yet, the state of public data is such that only a fraction of this value is being realized.

The reason is that accessing this data is often very hard. First of all its often hard to even find out what exists, as the sources are scattered, there is no central registry for existing data sets and many agencies don’t even publish information on the data that they have.

More worrying is that access to these data sets is made difficult by a number of restrictions, some accidental, other due to lack of funding to make them more accessible and some of these restrictions are even deliberate. These restrictions include license fees, proprietary or inadequate formats and unjustified legal complications.

I’d like to argue that any data gathered by a government organization should be made openly accessible online. Open access, means absence of all legal, technical and discriminating restrictions on the use or redistribution of data. A formal definition of Open Access can be found at opendefinition.org

The only exception to this rule should be when other interests – most importantly privacy issues – warrant access limitations.

There is a number of reasons for this. First of all, we (the taxpayers) have already paid for it, so it’s only logical that we can use the product we bought in any way we please. If gathering the relevant data and selling it can be a profitable business on its own, it should be done in the private sector, not by the government. Secondly it gives the public insight into the work done by our organizations in a similar way as Freedom of Information laws have done – mainly through media access to public sector documents and other information.

The most important argument – however – is that open access really pays off. Opening access and thereby getting the data in the hands of businesses, scientists, students and creative individuals will spur innovation and release value far beyond anything that a government organization can ever think of or would ever spend their limited resources on.

Some of these might be silly online games with little monetary value but yet highly entertaining. Others might be new scientific discoveries made when data from apparently unrelated data sources is mixed. And yet others might be rich visualizations that give new insights on some of the fundamental workings of society – showing where there’s need for attention and room for improvement.

A recent study on the state of matters with Public Sector data in the UK concluded that the lack of Open Access is costing the nation about 1 billion pounds annually in lost opportunities and lack of competition in various areas. Per capita, a billion pounds in the UK equals about 750 million ISK for Iceland and that’s without adjusting for Iceland’s higher GDP and arguably some fixed gains per nation.

Surely a huge opportunity for something that requires only a thoughtful policy change and a little budget adjustment to enable the institutions to make the needed changes and continue their great job of gathering valuable data.

– – –

See also an article from The Guardian, that helped spark a similar debate in the UK: Give us back our crown jewels

Enter the cloud – but how deep?

KenyaTanzania 665 The new company is officially founded and has got a name: DataMarket. You’ll have to stay tuned to hear about what we actually do, but the name provides a hint 😉

In the last couple of weeks I’ve spent quite some time, thinking about the big picture in terms of DataMarket’s technical setup, and it’s led me to investigate an old favorite subject, namely that of cloud computing.

Cloud computing perfectly fits DataMarket’s strategy in focusing on the technical and business vision in-house while outsourcing everything else. In other words, focus on the things that will give us competitive advantage, but leave everything else to those that best know how. Setting up and managing hardware and operating systems is surely not what we’ll be best at, so that task is best left to someone else.

This could be accomplished simply by using a typical hosting or collocation service, but cloud computing gives us one important extra benefit. It allows us to scale to meet the wild success we’re expecting, yet be economic if things grow in a somewhat milder manner.

That said, cloud computing will no doubt play a big role in DataMarket’s setup. But there are different flavors of “the cloud”. The next question is therefore – which flavor is right for this purpose?

I’ve been fortunate enough this year to hang out with people that are involved in two of the major cloud computing efforts out there: Force.com and Amazon Web Services (AWS). Additionally I’ve investigated Google AppEngine to some extent, and I saw a great presentation of Sun’s Network.com at Web 2.0 Expo.

These efforts can roughly be put in two categories:

  1. Infrastructure as a Service (IaaS): AWS and Network.com
  2. Platform as a Service (PaaS): Google AppEngine and Force.com

IaaS puts at your disposal the sheer power of thousands of servers that you can deploy, configure and set to crunch on whatever you throw at them. Granted, the two efforts mentioned above are quite different beasts and not really interchangeable for all tasks. Network.com is geared towards big calculation efforts, such as rendering of 3D movies or simulating nuclear chain reactions, whereas AWS is suitable for pretty much anything, but surely geared towards running web applications or web services of some sort.

PaaS gives you a more restricted set of tools to work with. They’ve selected the technologies you’re able to run and given you libraries to access underlying infrastructure assets. In the case of Force.com, your programming language is “Almost Java”, i.e. Java with proprietary limitations to what you’re able to do, but you also get a powerful API that allows you to make use of data and services available on Salesforce.com. AppEngine uses Python and allows you to run almost any Python library. In a similar fashion to Force.com, AppEngine gives you API access to many of Google’s great technological assets, such as the Datastore and the User service.

In short, the PaaS approach gives you less control over the technologies and details of how your application works, while it gives you things like scalability and data redundancy “for free”. The IaaS approach gives you a lot more control, but you have to think about lower levels of technical implementation than on PaaS. Example: On AWS you make an API call to fire up a new server instance and tell it what to do – on AppEngine, you don’t even know how many server instances you’re running – just that the platform will scale to make sure there is enough to meet your requirements.

So, which flavor is the right one?

The thinking behind PaaS is the right one. It takes care of more of the technical details and leaves the developers to do what they’re best at – develop the software. However, there is a big catch. If you decide to go with one of these platforms, you’re pretty much stuck there forever. There is no (remotely simple) way to take your application off Force.com or AppEngine and start running it on another cloud or hosting service. You might find this ok, but what if the company you’re betting your future on becomes evil? or changes their business strategy? or doesn’t live up to your expectations? or want’s to acquire you? or – worse yet – doesn’t want someone else to acquire you?

You don’t really have an alternative – and you’re betting your life’s work on this thing.

Sure. If I was writing a SalesForce application or something with a deep SalesForce integration, I’d go with Force.com. Likewise, if I was writing a pure web-only application, let alone if it had some connections with – say – Google Docs or Google Search, I’d be very tempted to use AppEngine. But neither is the case.

So the plan for DataMarket is to write an application that is ready to go on AWS (or another cloud computing service offering similar flexibility), but try as much as possible to keep independent of their setup. Not that I expect that I’ll have any reason to leave, but there is always the black swan, and when it appears you better have a plan. This line of thinking even makes me skeptical of utilizing their – otherwise promising – SimpleDB, unless someone comes up with an abstraction layer that will allow SimpleDB queries to be run against a traditional RDBMS or other data storage solutions that can be set up outside the Amazon cloud.

Yesterday, I raised this issue to Simone Brunozzi, AWS’ Evangelist for Europe. His – very political – answer was that this was a much requested feature and they “tended to implement those”, without giving any details or promises. I’ll be keeping an eye out for that one…

So to sum up. When I’ve been preaching the merits of Software as a Service (think SalesForce or Google Docs) in the past, people have raised similar questions about trusting a 3rd party for their critical data or business services. To sooth them, I’ve used banks as an analogy: A 150 years ago everybody knew that the best place to keep their money was under their mattress – or even better – in their personal safe. Gradually we learned that actually the money was safer in institutions that specialized in storing money, namely banks. And what was more – they paid you interest. The same is happening now with data. Until recently, everybody knew that the best place for their data was on their own equipment in the cleaning closet next to the CEO’s office. But now we have SaaS companies that specialize in storing our data, and they even pay interest in the form of valuable services that they provide on top of it.

So, your money is better of in the bank than under your mattress and your data is better of at the SaaS than on your own equipment. Importantly however, at the bank you have to trust that your money is always available for withdrawal at your convenience and in the same way, your data must be “available for withdrawal” at the SaaS provider – meaning that you can download it and use it in a meaningful way outside the service it was created on. That’s why data portability is such an important issue.

So the verdict on the most suitable cloud service for DataMarket: My application is best of at AWS as long as it’s written in a way that I can “withdraw” it and set it up elsewhere. Using AppEngine or Force.com would be more like depositing my money to a bank that promptly changes all your money to a currency that nobody else accepts.

I doubt that such a bank would do good as a business!

Bankastærðfræði 101

Þann 1. apríl síðastliðinn stofnaði ég banka. Þetta er alþjóðlegt fjármálafyrirtæki og af þeim sökum var höfuðstóllinn auðvitað í erlendri mynt: 1.000 evrur í fimm brakandi 200 evra seðlum.

Sökum framtaksleysis hefur lánastarfsemin látið á sér standa og seðlarnir eru enn á náttborðinu – og nú líður að fyrsta ársfjórðungsuppgjörinu. Mér skilst að greiningardeildir bankanna bíði í ofvæni.

Ég geri auðvitað upp í krónum, enda fékk ég ekki leyfi til annars frekar en aðrar bankastofnanir – það er víst hættulegt fyrir þjóðarhaginn.

Útlitið er svona líka glæsilegt. 1.000 evrurnar sem ég keypti 1. apríl fyrir 120.340 krónur reynast nú vera 131.810 króna virði. 9,5% ávöxtun á 3 mánuðum. Það jafngildir nærri 44% ávöxtun á ársgrundvelli – og þeir segja að það sé kreppa!

Rétt áðan hringdi svo vinur minn frá Frakklandi. Hann spurði út í bankastarfsemina. Þar sem hann skilur auðvitað ekki verðmæti í íslenskum krónum umreiknaði ég hagnaðinn – 11.470 krónur – í evrur og gat stoltur sagt honum að hagnaður fjórðungsins væri 87 evrur og ef svo héldi fram sem horfði yrði ég búinn að tvöfalda höfuðstólinn á rétt um tveimur árum. Hann óskaði mér til hamingju með að vera búinn að finna mína hillu í lífinu – það væri ekki að spyrja að þessum Íslendingum þegar bankastarfsemi væri annars vegar.

Eftir að við slitum samtalinu varð mér litið á fagurlega sléttaða seðlana, sem enn eru í bréfaklemmunni sem ég fékk í Laugavegsútibúi Landsbankans. Til öryggis tók ég klemmuna af og taldi.

Hvar eru þessar 87 evrur?

Starting up – that would be the fourth

Well, well, well.

I guess it’s some kind of a medical condition, but I’m leaving a great job at Síminn (Iceland Telecom) to start up a new company once again. This will be my fourth start-up, and I’m as excited as ever.

It will be a relatively slow migration as I’m finishing off a few projects at Síminn for the next couple of months, while at the same time setting up the new company, assembling a core team and refining the strategy for an idea that has been with me for some 18 months now, gradually getting more and more focused, until I became so obsessed that I simply had to go for it.

Some would claim this is a horrible time to start a company, with a gloomy economical outlook and a lot of turmoil in the world of business and IT.

I – however – see this as an opportunity. Due to these very conditions, highly qualified people are looking for exciting new opportunities. This is especially true here in Iceland, where the financial sector has drained the market of IT talent for the last 3-4 years, and those adventurous people that really would rather be working on something new and innovative have been tempted by the lucrative salaries and “never-ending party” of our booming banks. Now the banks (and others) are scaling down and being a lot more careful, so these people – many of them not necessarily in danger of loosing their job – might very well want to flex their start-up muscles again. Actually I know for a fact that this is the case.

Secondly, booms and busts in economy seem to come at an interval of 6-8 years. It takes at least 3-5 years to build a great company, so those starting now are likely to catch the next upswing, without having to run too fast for their own good – as long as they can build sufficient income or find the venture capital to fund their operations in the meantime.

The concept I’m working on – and my situation in general – is such that I believe I can pull this off.

I’m not willing to share publicly – just yet – what the concept is, but I’ll surely blog regular updates as things progress.

Long nights and fun times – here I come 🙂

iPhone, 3G and battery life

The rumours are getting ever louder. The new version of iPhone is coming out and it has 3G capabilities.

Many have expressed their concerns that this will seriously affect the iPhone’s already relatively short battery life time. 3G chipsets certainly drain the batteries faster than 2 or 2.5G (i.e. GSM, GPRS and EDGE) chips. Most 3G handsets – regardless of the brand – have short battery life compared to their lower generation counterparts and most 3G handset owners are familiar with their radiating heat, especially during heavy data – or even just voice – usage.

There have been talks about new OLED displays which would decrease the battery usage by the other main power sucker – the display – and then leave some room for added power consumption from the 3G chipset. And obviously the new iPhone’s battery won’t be inferior to the current model’s.

Before the first version of the iPhone came out, I made a few predictions that turned out to be pretty accurate, so I’ll give it another shot: During a chat with my colleague – Chad Nordby – last week, we came across another method to dramatically increase the battery life: Turn 3G on only when high bandwidth is needed.

The phone would stay on the 2G network during normal operations. There is no need to drain the battery on UMTS (i.e. 3G) communications while idly waiting for a call. When the user activates the device, 3G could be turned on and ready for your high-speed browsing – very much the same way as is currently done for the WiFi capabilities. The same might also work for background communication – 3G could be turned on to fetch large payloads, such as a big email attachment, but minor status updates and mail checks could stay on the EDGE network.

The handover technology from 2G to 3G (or technically from GSM to UMTS) networks is already there and works quite well. When you’re drive out of 3G coverage, talking on the phone, even the phonecall is handed from one technology to another instantaneously without dropping the call.

So there’s nothing stopping them, and as such a small fraction of time is spent on usage of data services anyway (compared to idle time) this would fully address the battery concerns. So it is my prediction the new iPhone will work this way. Obviously any handset manufacturer could use this method, but Apple is probably the only one crazy enough to actually be thinking on these notes.

The sheep might catch on later 🙂

hjalli.com á nýjum slóðum

Ég er horfinn í skýið.

Ég hef ákveðið að hætta að reka minn eigin vefþjón og bloggkerfi og koma þessu frekar fyrir í hýstri umsjón hjá WordPress.com. Þá þarf ég ekki lengur að sjá sjálfur um að uppfæra bloggkerfið, viðhalda spamvörnunum og tryggja að óprúttnir aðilar nýti ekki nýjustu öryggisholurnar í viðkomandi kerfum. Það er alveg 700 króna virði á ári 🙂

Í sama mund flutti ég reyndar hjalli.com póstinn yfir á Google Apps – þar get ég notað allar Google þjónusturnar, s.s. GMail, Google Calendar og Google Docs á eigin léni. Ekki slæmt.

Þetta hefur hingað til verið hýst hjá snillingunum í Basis og ég þakka þeim fyrir frábæra þjónustu. Ég verð áfram með sandkassann minn (tilraunaserverinn þar sem ég fikta með forritun eins og How far… þjónustuna) hjá þeim.

Annars er þessi færsla mest tilraunafærsla til að sjá hvort þetta er allt komið í lag. Það brotna einhverjar gamlar slóðir við yfirfærsluna, en RSS-slóðirnar eiga að vera óbreyttar. Ef þið rekist á eitthvað sem ég hef bramlað í hamaganginum þá eru ábendingar vel þegnar.

Tvær skemmtilegar tæknisögur

Ég er svo heppinn að vera þessa dagana staddur á ráðstefnu sem nefnist Web 2.0 Expo í San Francisco. Fyrsti dagurinn var í gær og margt fróðlegt sem þar kom fram. Að öðrum ólöstuðum stóð þó fyrirlestur Clay Shirky, prófessors við NYU. Shirky er einskonar upplýsingatækni-félagsfræðingur, þ.e. hann veltir mikið fyrir sér þeim breytingum sem eru að verða á samfélaginu samhliða þróun upplýsingatækninnar.

Hér eru tvær stórskemmtilegar og merkilegar sögur sem hann sagði í erindi sínu í gær – hér lauslega endursagðar af mér á íslensku:

Shirky var í viðtali við sjónvarpsfréttamann sem var að velta fyrir sér samfélagsmiðlum á borð við Wikipediu. Shirky byrjar að segja henni sögu um færsluna um Plútó og hvernig Wikipedia-samfélagið tók á því þegar Plútó var lækkaður í tign úr reikistjörnu í loftstein á óvenjulegri braut um sólina. Þetta var nokkuð stórt mál og menn tókust á um breytingarnar, en komust þó að niðurstöðu.

Viðbrögð fréttamannsins sem Shirky hafði ætlað að heilla með þessari sögu voru svona “lúser-stara”, þögn og svo “Hvernig hefur fólkið tíma í svona lagað?”.

Við þetta snappaði Shirky víst og benti henni á að engum sem ynni við sjónvarp réttlættist að spyrja svona. Og svo fór hann að reikna: Heildartíminn sem hefur farið í að byggja upp Wikipediu – allar greinar á öllum tungumálum – er núna um 100 milljón vinnustundir. Þetta er vissulega há tala – enda stórkostlegt verk í flesta staði – en ef hún er sett í samhengi við sjónvarpsáhorf verður hún að nánast engu. Um síðustu helgi eyddu Bandaríkjamenn einir 100 milljón stundum í það eitt að horfa á auglýsingar í sjónvarpinu! Á ári horfa Bandaríkjamenn á sjónvarp í 200 milljarða klukkustunda og heimurinn allur um það bil 1000 milljarða stunda. Það eru 10.000 Wikipediur – á ári!

Tíminn sem fréttakonan var að spyrja um kemur til af minnkandi sjónvarpsáhorfi og Shirky bað menn um að ímynda sér hvað við gætum átt von á að sjá ef núverandi trend um minnkandi sjónvarpsáhorf heldur áfram að gefa af sér svipaða vinnu.

Hin sagan er miklu sætari, en bendir líka á áhugaverða breytingu.

Faðir og fjögurra ára dóttir eru að horfa á Disney-teiknimynd í sjónvarpinu. Nú kemur að einhverjum kafla í myndinni þar sem dótturinni fannst heldur lítið vera að gerast. Hún stendur upp og skríður á bakvið sjónvarpið.

Pabbinn á von á einhverju skemmtilegu eins og að hún sé að athuga hvort hún geti náð teiknmyndafígúrunum sem séu þarna einhvernvegin inni í eða á bakvið imbakassann, en í staðinn fer hún að róta í snúrunum.

Pabbinn spyr hana hvað hún sé að gera.

“Leita að músinni” var auðvitað svarið sem hann fékk.

Fjögurra ára krakkar vita nefnilega að skjár sem ekki er með mús eða ekki a.m.k. hægt að eiga í gagnvirkum samskiptum við með einhverjum hætti – er bilaður skjár. Þetta er kynslóðin sem við erum að ala upp og segir sennilega mikið um það hvernig upplýsinga- og afþreyingarneysla á eftir að breytast á næstu árum.

Datt í hug að deila þessu með ykkur 🙂

Adventures in copyright: Open access, data and wikis

I’ve just had a very interesting experience that sheds light on some important issues regarding copyright, online data and crowdsourced media such as wikis. I thought I’d share the story to spark a debate on these issues.

For a couple of years I’ve worked on and off on a simple web based system for maintaining and presenting a database of inflections of Icelandic words: Beygingarlýsing íslensks nútímamáls or “BÍN” for short. The data is available online, but the maintenance system is used by an employee of the official Icelandic language institute: Stofnun Árna Magnússonar í íslenskum fræðum. She has been gathering this data, and deriving the underlying structure for years, during a period spanning up to or over a decade. As you can imagine, BÍN is an invaluable source for a variety of things, ranging from foreigners learning Icelandic to the implementation of various language technology projects.

Now before I go any further I think it’s important to say that I’m a big supporter of open data. In fact, one of the few things I’ve ever gotten involved in actively lobbying for is open access to data in the public sector (article unfortunately in Icelandic).

Back to the story. A couple of days ago I got a call from the aforementioned BÍN administrator. She’d gotten a tip that someone was systematically copying data from BÍN into the Icelandic Wiktionary and asked me to look into it.

I started going through the web server log files – and sure enough – comparing the log files to the new entries page on Wiktionary, the pattern was obvious: A search for a word in BÍN and 2-3 minutes later a new entry in Wiktionary with that same word. A pattern consistent with someone copying the data by hand. This pattern went back a few days at least. Probably a lot longer.

In light of this I blocked access from the IP addresses that these search requests originated from and redirected them to a page that – in no uncertain terms – stated our suspicion of abuse and listed our email addresses in order for them to contact us for discussion.

Now – BÍN is clearly stated as copyrighted material – and as the right holder of the content, the institute has the full right to control the access to and usage of their data. Inflections of a word are obviously not intellectual property, but any significant part of a collection of nearly 260.000 such words definitely is.

As said before, I’ve personally been advocating for open access to all public sector data, but I also know that this is a complicated issue – far beyond the opinion of the people working with individual data sets. This institute – for example – must obey the rules set to them by the Ministry of Education, and changing those rules is something that must be taken up on an entirely different level.

The Wiktionary users in question have since contacted us and stated that they were not copying the content, merely referencing it when proofreading their own information. I have no reason to doubt that, but the usage pattern was indistinguishable from a manual copying process, leading to the suspicion and the blocking of their addresses.

We’ve since then exchanged several emails and hopefully we’ll find a way for all parties to work together. It would be fantastic if the enthusiasm and great work that is being put into building the Wiktionary could be joined with the professional experience and scientific approach exercised by the language institute to build a common source with clear and open access.

In the end of the day, open access to fundamental data like this will spur innovation and general prosperity, but as this story shows this is not something that will happen without mutual respect and consensus on the right way to move forward.

Updated Apr. 24: Discussion about this incident is also taking place here and here (both are at least partly in Icelandic).

Viðurkenning fyrir Afríkumyndir

Best of Uganda and Rwanda Magga er búin að sitja sveitt við undanfarið að setja saman Tabblo með myndunum okkar frá Úganda og Rúanda.

Þetta skiptist í tvennt:

Bestu myndirnar fóru í loftið í gær og viti menn: Tabblo valdi þær “Tabblo dagsins” (sjá forsíðu – innskráðir notendur)!

Þannig að við erum orðnir frægir ljósmyndarar og Magga frægur Tabblo-smiður…