search

Orðið.is: Lóð á vogarskálar Opinna Gagna

Í hádeginu í dag voru úrslitin í verðlaunasamkeppninni “Þú átt orðið” kynnt. Það voru fyrirtækið og Stofnun Árna Magnússonar í íslenskum fræðum sem stóðu fyrir þessari keppni.

Forsaga keppninnar

Forsagan er í stuttu máli sú að forritunarteymi Já og forveri þess – fyrirtækið Spurl – sem Já keypti fyrir nokkrum árum* hafa um árabil átt í samstarfi við Orðabók Háskólans (sem nú er hluti Árnastofnunar) á sviði tungutækni. Það samstarf hefur sérstaklega snúist um Beygingalýsingu íslensks nútímamáls, en það er gagnasafn sem inniheldur beygingarmyndir meira en 270 þúsund íslenskra orða.

Já hefur nýtt sér þetta safn með ýmsum hætti, t.d. til að tryggja að leit á vefnum Já.is finni íslensk nöfn, götuheiti, staði og fyrirtæki óháð því í hvaða beygingarmynd fyrirspurnir eru skrifaðar (ertu að leita að “Laugavegi” eða “Laugavegur“, “sýslumaður” eða “sýslumanninum”, o.s.frv.), við gerð tillagna þegar leitarorð eru rangt slegin inn “Leit að ‘laugvegur‘ skilaði engum niðurstöðum. Áttirðu við ‘Laugavegur’?” o.s.frv.

Já-fólk hefur því lengi vitað hvers konar verðmæti felast í þessum gögnum og við vorum nokkuð viss um að þau verðmæti kæmu fyrst almennilega í ljós þegar aðgangur að þessum gögnum væri opnaður frekar. Þannig kviknaði hugmyndin að því að Já myndi styrkja Árnastofnun og gera stofnuninni þannig kleift að aflétta þeirri gjaldtöku sem hingað til hefur verið á notkun þessarra gagna. Það varð úr, og til að hvetja hugmyndaríka einstaklinga til dáða var auk þess ákveðið að blása til þessarar verðlaunasamkeppni.

Opið aðgengi leiðir til nýsköpunar

Orðavindan

1. verðlaun í keppninni hlaut orðaleikurinn Orðavinda

Í stuttu máli tókst þessi tilraun vonum framar. 20 álitlegum verkefnum var skilað inn á tilsettum tíma. Það skemmtilegasta við þau var hversu fjölbreytt þau voru. Þannig náðu t.d. verðlaunaverkefnin fjögur allt frá nýrri málfræðilega áhugaverðri nálgun við orðflokkun, til tölvuleikja og allt frá gagnlegu tóli fyrir vefnotendur, til “startpakka” fyrir forritara sem vilja nýta sér þessi gögn til annarra góðra verka.

Og þetta var vonandi bara byrjunin. Ég er sannfærður um að miklu fleiri en þeir sem tóku þátt í samkeppninni munu nýta sér þessi gögn hér eftir með margvíslegum hætti og veit reyndar af nokkrum slíkum verkefnum sem eru í gangi.

Þessi afrakstur styrkti mig enn frekar í trúnni um það hversu mikil verðmæti er hægt að leysa úr læðingi með því að opna aðgengi að gagnasöfnum á vegum opinberra aðila. Fjársjóðir á borð við þennan liggja vannýttir og jafnvel ónotaðir hjá stofnunum og fyrirtækjum úti um allt land, en gætu orðið að nýjum vörum, nýjum tækifærum og jafnvel nýrri þekkingu ef aðferðafræði Opinna Gagna fengi að ráða.**

Vonandi verður þingsályktunartillagan góða sem samþykkt var í þá veru fyrir áramótin sem fyrst til þess að þessi mál komist á skrið hér á landi.

– – –

* Ég var stofnandi og einn af aðaleigendum Spurl á sínum tíma
** Gögnin í Beygingarlýsingunni eru reyndar strangt til tekið ekki að öllu leyti “opin” skv. skilgreiningu opinna gagna, en sannarlega opnari en þau voru.

White male seeks single search engine

As I store more and more data in different online applications, the need for a search engine that can search across them all becomes apparent.

I have photos on Flickr, blog posts here on my blog, bookmarks in Spurl.net, contribute to several project Wikis, have written articles for a number of diferent online publications, edit documents in Google Docs & Spreadsheets, enter data to DabbleDB and try to store most other documents in an online folder using WebDAV.

I’m convinced that online apps will replace most if not all desktop applications, but with all my data scattered all over the place, an obvious drawback is the lack of searchability.

Most of the applications mentioned above have RSS feeds, APIs or other relatively open and simple ways to get the data out. So how about one app to “search them all and forever bind them”?

Here’s how it would work: Type in a search phrase and get in return a list of results linking to the individual entries or items in their respective applications – displays thumbnails for extra credit.

Start with supporting the most common Web 2.0 services and make a way to easily add support for additional ones; ensure that I can keep my private data private and – voila – you have at least one paying customer.

Where do I sign up?

Spurl launches an Icelandic search engine

Originally posted on the Spurl.net forums on November 1st 2005

Spurl.net just launched an Icelandic search engine with the leading local portal / news site mbl.is. Although Iceland is small, mbl.is is a nice, mid-size portal with some 210,000 unique visitors per week – so this has significant exposure.

The search engine is called Embla – it’s a pun on the portal’s name (e mbl a) and the name of the first woman created by the Gods in Norse Mythology.

An Icelandic search engine may seem a little off topic for many of you Spurl.net users, but it is relevant for several reasons:

  • We have been busy for some 4 months now, building the next version of Zniff. Embla uses this new code and this code will make its way into both Spurl.net (for searching your own spurls) and Zniff (for searching the rest of the Web) in the next weeks. The engine is now reliable, scalable, redundant and compliant with a lot of other .com buzzwords – with more relevant search results and subsecond response times on most queries.
  • This commercial arrangement with mbl.is (hopefully the first of many similar) gives us nice financial footing for further development of Spurl.net and related products.
  • Embla uses information from Spurl.net users. Even though only a small portion of our users are Icelandic, we’re using their information as one of the core elements for ranking the search results – and with good outcome. This strengthens our believe in the “human search engine” concept for other markets as well as in the international playing ground.

I plan to write more on the search engine itself on my blog soon, but just wanted to mention briefly that with Embla we’re also breaking new ground in another territory: Embla “knows” Icelandic. Most search engines and technologies are of English origin, and from a search standpoint, English is a very simple language. Most English words have only a couple of word forms (such as “house” and houses”) and while some search engines use stemming (at least sometimes), it doesn’t matter all that much for English. Many languages – including Icelandic – are far more complicated. Some words can – hypothetically – have more than 100 different word forms. In reality a noun will commonly have about 12-16 unique word forms. Now THIS really matters. The difference in the number of returned results is sometimes 6 to 10-fold, and it improves relevance as well.

We have built Embla so that it searches for all forms of the user’s search words. We also offer spelling corrections for Icelandic words, based on the same lexicon. The data for this comes from the Institute of Lexicography, University of Iceland, but the methodology and technology is ours all the way and can (and will) be used for other languages as well. Quite cool stuff actually – as mentioned, I will write more about that on the blog soon.

Google and user driven search indexes

Needless to say, I am a firm believer that the next big steps in web search will come from involving the users more in the ranking and indexing process.

The best search engines on the web have always been built around human information. In the early days, Yahoo! was the king, first based on Jerry Yang’s bookmark collection and later on having herds of human editors categorizing the web building on top of Jerry’s collection. The web’s exponential growth soon blew this model and for a year or so, the best ways to search the web involved trying to find one or two useful results in full pages of porn and casino links on Altavista or HotBot.

Then Google came along and changed everything with their genius use of the PageRank algorithm, assuming that a link from one web page to another was a vote for the quality and relevance of the content of the linked page. You know the story. Today, all the big engines use variations of this very same core methodology, obviously with a lot of other things attached to fight spam and otherwise improve the rankings.

Next steps: From page content to user behavior

But the next step is due and the big guys in the search world are really starting to realize this. A lot of the recent innovations coming from the search engines – especially from Google – are targeted directly at gathering more data about user behavior that they can in turn use to improve web search.

Among the behavior information that could be used for such improvements are:

  • Bookmarks: A bookmark is probably the most direct way a user can say “I think this page is of importance”, additionally users usually categorize things that they bookmark, and some browsers and services – such as my very own Spurl.net – make the bookmarks more useful by allowing users to add various kinds of meta data about the pages they are bookmarking. More on the use of that in my recent “Coming to terms with tags” post.
  • History: What pages has the user visited? The mere fact that a user visited a page does not say anything about it other than that the page owner somehow managed to lure the user there (usually with good intentions). The fact that a page is frequently visited (or even visited at all) does however tell the search engine that this page is worth visiting to index. This ensures that fresh and new content gets discovered quickly and that a successful black hat SEO tricks (in other words “search engine spam”) don’t go around for long without getting noticed.
  • Navigational behavior: Things such as:
    • which links users click
    • which links are never clicked
    • how far down do users scroll on this page
    • how long does a user stay on a page
    • do users mainly navigate within the site or do the visit external links
    • etc.

All of these things help the search engine determining the importance, quality and freshness of the page or even of individual parts of the page.

All of these things and a lot more are mentioned in one way or another in Google’s recent patent application. Check this great analysis of the patent for details. Here’s a snip about user data:

  • traffic to a document is recorded and monitored for changes (possibly through toolbar, or desktop searches of cache and history files) (section 34, 35)
  • User behavior is websites are monitored and recorded for changes (click through back button etc)(section 36, 37)
  • User behavior is monitored through bookmarks, cache, favorites, and temp files (possibly through google toolbar or desktop search) (section 46)
  • Bookmarks and favorites are monitored for both additions and deletions (section 0114, 0115)
  • User behavior for documents are monitored for trends changes (section 47)
  • The time a user spends on website may be used to indicate a documents quality of freshness (section 0094)

Google’s ways of gathering data

And how exactly will Google gather this data? Simple: by giving users useful tools such as the Toolbar, Deskbar and Desktop Search that improves their online experience and as a side effect provides Google with useful information such as the above.

I’ve dug through the privacy policies for these tools (Toolbar, Desktop Search, Deskbar) and it’s not easy to see exactly what data is sent to Google in all cases. The Toolbar does send the them the URL of all visited pages to provide the PageRank and a few other services. It is opt-out, but default on, so they have the browser history of almost every user that has the toolbar set up. Other than this, details are vague. Here are two snips from the Desktop search Privacy FAQ:

“So that we can continuously improve Google Desktop Search, the application sends non-personal information about things like the application’s performance and reliability to Google. You can choose to turn this feature off in your Desktop Preferences.”

“If you send us non-personal information about your Google Desktop Search use, we may be able to make Google services work better by associating this information with other Google services you use and vice versa. This association is based on your Google cookie. You can opt out of sending such non-personal information to Google during the installation process or from the application preferences at any time.”

It sounds like error and crash reporting, but it does not rule out that other things, like bookmarks, time spent on pages, etc. are sent also. None of the above mentioned products provides a finite, detailed list of things that are sent to Google.

Next up: Google delivering the Web

And then came the Web Accelerator. Others have written about the privacy, security and sometimes actually damaging behavior of the Web Accelerator, but as this blog post points out, the real reason they are providing it is to monitor online behavior by becoming a proxy service for as many internet users as possible.

I’m actually a bit surprised with the web accelerator for several reasons. Google is historically known for excellent quality of their products, but this one seems to have been publicly released way too early judging from the privacy, security and bug reports coming from all over the place. Secondly, the value is questionable. I’m all for saving time, but if a couple of seconds per hour open up a range of privacy and security issues I’d rather just live a few days shorter (if I save 2 seconds every hour for the next 50 years that amounts to 10 days) 🙂

The third and biggest surprise come from the fact that I thought Google had already bought into this kind of information with their fiber investments and the aqcuisition of Urchin. Yes, I’m pretty sure that the fiber investment was made to be able to monitor web traffic over their fat pipes for the reasons detailed above – even as a “dumb” bandwidth provider. Similar things are already done by companies such as WebSideStory and Hitwise that buy traffic information from big ISPs.

Urchin’s statistics tools can be used for the same purpose in various ways – I’m pretty sure Google will find a way to make it worth an administrator’s trouble to share his Urchin reports with Google for some value-add. So why bother with the Web Accelerator proxy play?

Google already knows the content of every web page on the visible web, now they want to know how we’re all using it.

Good intentions, bad communication

Don’t get me wrong. I’m sure that the gathering of all these data will dramatically improve web search and I marvel the clever technologies and methods in use. What bothers me a bit is that Google is not coming clear about how they will use all this data, or even the fact that they are and to what extent they are gathering it.

I can see only two possible reasons for this lack of communication. The first one being that they don’t want to tell their competitors what they are doing. This can’t be the case, Yahoo, MSN and Ask Jeeves surely have whole divisions that do nothing but reverse engineer Google products, analyze their patent filings and so on. That’s just naive. The second reason I can think of is that they are afraid that users will not be just as willing to share this information if it becomes so clearly visible how massive their data gathering efforts have become.

I’m not an extremely privacy concerned person myself, but I respect, understand and support the opinions of those that are. The amount of user data that is gathered will gradually cause more people to think what would happen if Google turned evil.

– – –

Hjalmar Gislason is the founder of Spurl.net. The above reflects some of the things his company is doing with the Spurl.net bookmarking system and the Zniff search engine.

Coming to terms with tags: folksonomies, tagging systems and human information

Over the past few years we’ve seen a big movement from hierarchical categories to flat search. Web navigation and email offer prime examples: Yahoo’s Directories gave way for Google’s search, Outlook’s folders are giving way for the search based Gmail. It’s far more efficient to come up with and type in a few relevant terms for the page or subject you’re looking for than it is to navigate a hierarchy. A hierarchy that may even be built by somebody else, somebody that probably has a different mindset, and therefore categorizes things differently than you would.

Lately, tags have become the hot topic when discussing information organization (or – some might say – the lack thereof). But tagging and flat search are really just two sides of the same coin.

The main reason tagging is so useful is that it resembles and improves search. It assigns relevant terms to web pages, images or email messages, making it easier to find them by typing in one or more of those relevant terms or selecting them from a list. Furthermore, tag based system assign YOUR terms to a resource – based on your mindset – not somebody else’s, making them extra useful for managing one’s own information. The more terms you assign to the resource initially, the more likely it is that one of them comes to mind when you’re looking it up again later on.

But you also want to find new information, not only what you’ve already found or seen. And again, tagging comes to the rescue. If somebody else has found a term relevant for a page, it’s likely that so will you. If that person’s mindset (read: use of tags, range of interests and other behavior) resembles yours, the more likely it is to be of use to you. Furthermore, if a lot of people agree on assigning the same tag to a resource, the more relevant we can assume it to be for that resource.

Tags don’t tell the whole story

Tags are wonderful, but they won’t cure cancer all by themselves.

When tagging a resource – especially in these early days of tagging systems – it is quite likely that you are the first person to tag it. Think as hard as you can and you will still not come up with all the terms that you nevertheless would agree are highly relevant to the subject – terms that you might even think of later when you want to look it up again. It is therefore important to use other methods as well to help assigning relevant terms to the resource, when possible.

Firstly, there are other ways than tagging to help users to identify relevant terms. By allowing them to highlight the snips from the text that they find most interesting, you can give extra weight to that text. Descriptions of the resource, written in a free format are another way. Here you can see an example of a page that has been highlighted using tags, snips and other information from Spurl.net users (another example here).

Secondly you have the author’s own information and that should (usually) account for something. In the case of a web page you have the page text and its markup. The markup identifies things that the page author found to be important, among them:

  • The page title
  • Headings
  • Bolded and otherwise emphasized text
  • Meta keywords, descriptions and other meta information
  • Things near the top are usually more relevant than at the bottom

Analyzing all this text information (both the user’s and the author’s) in relation to expected word frequency identifies more relevant terms than the tags alone.

And there are even more ways to get relevant terms, link analysis (words that appear in and around links to the document and the relevancy of the linking documents), etc. Basically all the methods that Google-like engines are already using to index pages using spiders and other automated methods.

Relevant terms are a consensus

That said, relevant terms for a resource must be seen as a consensus. A consensus between the user himself, other users, the page author and the “rest of the web”.

If all of these players are roughly in agreement, all is well and the term relevancy can be trusted. If the information from the page’s author is in no agreement with information from the users, it is likely that the author is trying to trick somebody – most likely an innocent search engine bot that wanders by – and the author’s information should be given less weight than otherwise.

When searching for something, you don’t want to end up empty handed just because nobody had thought of tagging a resource with the terms that you entered. The relevant terms must be seen as layers of information, where your tags and other terms that you have assigned play a central role, then comes information from other users – giving extra weight to those users that are “similar” to you (based on comparing previous behavior), then the page author and finally the “rest of the web”.

Trust systems

In addition, you need trust systems to prevent the search from being “gamed”. Trust should be assigned to all players in the chain: users, authors and linking web pages. If a long-time, consistent and otherwise well-behaving user assigns a tag to a page, we automatically give it high weight. If a new user includes a lot of irrelevant information on his first post, his trust is ruined and will take a long time to be regained.

Users gradually build trust within the system and if they start to misbehave, it affects everything that they’ve done before. Same goes for web page authors and linking pages. If your information is in no way in line with the rest of the consensus, your information is simply ignored. Providing relevant information is the only way to improve your ranking. (more in this post on Spurl.net’s user forums).

On the author layer, this could even resurrect the long dead keyword and description meta-tags, that are – after all – a good idea, if it hadn’t been so blatantly misused.

Coming to terms with tags

The “folksonomy” discussion, so far, has been largely centered on user assigned tags and tagging systems, but these are just a part of a larger thing.

Relevant terms – as explained above – are really what we are after, and it just so happens that tags are the most direct way to assign new relevant terms to a resource. But they are far from the only one. Tagging is not “the new search”; it is one of many ways to improve current search methods. A far bigger impact is coming from the fact that we’re starting to see human information – tags being one of many possible sources – brought into the search play.

– – –

Hjalmar Gislason is the founder of Spurl.net. The above reflects some of the things his company is doing with the Spurl.net bookmarking system and the Zniff search engine.

Is search the “next spam”?

I was a panelist at a local conference on Internet marketing here in Iceland last Friday.

As you can imagine, a lot of the time was spent on talking about marketing using search engines. Both how to use paid for placement (i.e. ads) and how to optimize pages to rank better in the natural search results.

The latter, usually called “search engine optimization” or just SEO has a somewhat lousy reputation. More or less everybody that owns a domain has gotten a mail titled something like: “Top 10 ranking guaranteed” from people that claim that they can work some magic on your site (or off it) and it will get you to the top of the natural search engine results for the terms that you want to lead people to your site.

Needless to say, these people are fraud. If the spam isn’t enough of a giveaway there are two additional reasons. First of all, all the major search engines guard their algorithms fiercely and change them frequently, exactly to fight off people that are trying to game their systems in this way. Secondly they will probably choose an oddball phrase (see the story of Nigritude Ultramarine) that nobody else is using and have you put that on your site. Needless to say, it will get you a top 10 ranking ‘coz you’re the only one using it.

Additionally some of these “respectable professionals” have so called link farms that are in fact huge infinite loops of pages linking to each other in order to give the impression that any one of these pages is a highly popular one because it has so many incoming links. The link farm pages (here is an OLD blog entry I did on that) are usually machine generated and make little or no sense to humans, but may trick the search engines – for a while. When the search engine engineers find out however, they will get in foul mood and are likely to punish or even ban all the pages involved from their search engines, and then you’re better off without them in the first place!

There is a range of other tricks that bad SEOs use, but the results from using them are usually the same: It will rank you high for a while, and then have the opposite effect.

On the other hand, there are a series of things that you should do to make it easier for search engines to understand what your site is all about, without playing dirty. Most of these things are exactly the same things as make the page more usable for humans: Descriptive page titles, informative link texts, proper use of the terms you want to link to your site and last but not least good and relevant information on your subject. If you have a highly usable and informative site, it will lead to people linking to you as an information source and that will give you quality, incoming links from respectable sites – one of the most important things for ranking well, especially at Google.

Enough about that – others have written much better introductions to search engine marketing and a lot of respectable people make a living out of giving healthy tips and consult on these issues.

The main point is that in the search engine world, there is a constant fight between the bad SEOs and the engineers at the search engine companies.

During dinner, after the conference – I had a very interesting chat with one of the speakers, who expressed his concern that “search was the next spam”, meaning not that people will try to game the search engines (they have done so sinvce way back in the 90s), but that they will get so good at it that it will start to hurt the search business in the same way as email spam has hurt email.

As mentioned before, there are two ways to get exposure in search engine results: paid for placement and natural search results. With paid search estimated a 2.2 billion dollar industry last year, there is a lot of incentive for a lot of clever (yet maybe a little immoral) engineers to try their best to rank high in the natural results. The natural results are a lot more likely to get clicked and therefore more valuable than the paid ones – anybody want to estimate a number?

And the evidence is out there. Even Google sometimes gives me results that are obviously there for the “wrong reason”. Other engines do it more frequently. And even with the “non-spam” results there has been talk recently about the majority of search results being commercial – even in fields where you would expect to get both commercial and non-commercial results.

The paid for placement results are of course commercial by nature – no problem there as long as they are clearly marked as such. The natural results on the other hand, should reflect the best information on the web – commercial or not. The problem is that commercial sites are usually made by professionals and by now, most of them pay – at least some – attention to optimization that will help them getting higher ranking in the search engines. Not necessarily the bad tricks, but just the good ones. And therefore the majority of the natural results is and will be commercial too.

Search engine indexes, populated by machine bots – engineering marvels as they are – simply cannot make the needed distinction here. The ones that can tell what people will find to be good results are – you guessed it – other people.

Human information will become increasingly important in the fight against spam and to keep the search engine results free of commercial bias. Human information was what created Yahoo! in the beginning. This was also the brilliance of Google’s page rank and link analysis when they began – they were tapping into human information, links that people (webmasters) created were considered a “vote” and meta information about the page they linked too. This is what search engines saw as a major asset in the blogging community and this is why humanly created indexes like the ones that are constantly growing now at bookmarking services such as my Spurl.net, del.icio.us and LookSmart’s Furl will become major assets in the search industry in the coming months and years. And with a decent reputation / trust system (think Slashdot), it will be relatively easy to keep the spammers out – at least for a while.