• Home
  • About / Um Hjalla

hjalli.com – Hjálmar Gíslason

Technology and other wonders / Tækni og fleiri undur veraldar

Feeds:
Posts
Comments
« Coming to terms with tags: folksonomies, tagging systems and human information
Using AJAX to track user behavior »

Google and user driven search indexes

May 9, 2005 by Hjalmar Gislason

Needless to say, I am a firm believer that the next big steps in web search will come from involving the users more in the ranking and indexing process.

The best search engines on the web have always been built around human information. In the early days, Yahoo! was the king, first based on Jerry Yang’s bookmark collection and later on having herds of human editors categorizing the web building on top of Jerry’s collection. The web’s exponential growth soon blew this model and for a year or so, the best ways to search the web involved trying to find one or two useful results in full pages of porn and casino links on Altavista or HotBot.

Then Google came along and changed everything with their genius use of the PageRank algorithm, assuming that a link from one web page to another was a vote for the quality and relevance of the content of the linked page. You know the story. Today, all the big engines use variations of this very same core methodology, obviously with a lot of other things attached to fight spam and otherwise improve the rankings.

Next steps: From page content to user behavior

But the next step is due and the big guys in the search world are really starting to realize this. A lot of the recent innovations coming from the search engines – especially from Google – are targeted directly at gathering more data about user behavior that they can in turn use to improve web search.

Among the behavior information that could be used for such improvements are:

  • Bookmarks: A bookmark is probably the most direct way a user can say “I think this page is of importance”, additionally users usually categorize things that they bookmark, and some browsers and services – such as my very own Spurl.net – make the bookmarks more useful by allowing users to add various kinds of meta data about the pages they are bookmarking. More on the use of that in my recent “Coming to terms with tags” post.
  • History: What pages has the user visited? The mere fact that a user visited a page does not say anything about it other than that the page owner somehow managed to lure the user there (usually with good intentions). The fact that a page is frequently visited (or even visited at all) does however tell the search engine that this page is worth visiting to index. This ensures that fresh and new content gets discovered quickly and that a successful black hat SEO tricks (in other words “search engine spam”) don’t go around for long without getting noticed.
  • Navigational behavior: Things such as:
    • which links users click
    • which links are never clicked
    • how far down do users scroll on this page
    • how long does a user stay on a page
    • do users mainly navigate within the site or do the visit external links
    • etc.

All of these things help the search engine determining the importance, quality and freshness of the page or even of individual parts of the page.

All of these things and a lot more are mentioned in one way or another in Google’s recent patent application. Check this great analysis of the patent for details. Here’s a snip about user data:

  • traffic to a document is recorded and monitored for changes (possibly through toolbar, or desktop searches of cache and history files) (section 34, 35)
  • User behavior is websites are monitored and recorded for changes (click through back button etc)(section 36, 37)
  • User behavior is monitored through bookmarks, cache, favorites, and temp files (possibly through google toolbar or desktop search) (section 46)
  • Bookmarks and favorites are monitored for both additions and deletions (section 0114, 0115)
  • User behavior for documents are monitored for trends changes (section 47)
  • The time a user spends on website may be used to indicate a documents quality of freshness (section 0094)

Google’s ways of gathering data

And how exactly will Google gather this data? Simple: by giving users useful tools such as the Toolbar, Deskbar and Desktop Search that improves their online experience and as a side effect provides Google with useful information such as the above.

I’ve dug through the privacy policies for these tools (Toolbar, Desktop Search, Deskbar) and it’s not easy to see exactly what data is sent to Google in all cases. The Toolbar does send the them the URL of all visited pages to provide the PageRank and a few other services. It is opt-out, but default on, so they have the browser history of almost every user that has the toolbar set up. Other than this, details are vague. Here are two snips from the Desktop search Privacy FAQ:

“So that we can continuously improve Google Desktop Search, the application sends non-personal information about things like the application’s performance and reliability to Google. You can choose to turn this feature off in your Desktop Preferences.”

“If you send us non-personal information about your Google Desktop Search use, we may be able to make Google services work better by associating this information with other Google services you use and vice versa. This association is based on your Google cookie. You can opt out of sending such non-personal information to Google during the installation process or from the application preferences at any time.”

It sounds like error and crash reporting, but it does not rule out that other things, like bookmarks, time spent on pages, etc. are sent also. None of the above mentioned products provides a finite, detailed list of things that are sent to Google.

Next up: Google delivering the Web

And then came the Web Accelerator. Others have written about the privacy, security and sometimes actually damaging behavior of the Web Accelerator, but as this blog post points out, the real reason they are providing it is to monitor online behavior by becoming a proxy service for as many internet users as possible.

I’m actually a bit surprised with the web accelerator for several reasons. Google is historically known for excellent quality of their products, but this one seems to have been publicly released way too early judging from the privacy, security and bug reports coming from all over the place. Secondly, the value is questionable. I’m all for saving time, but if a couple of seconds per hour open up a range of privacy and security issues I’d rather just live a few days shorter (if I save 2 seconds every hour for the next 50 years that amounts to 10 days) :)

The third and biggest surprise come from the fact that I thought Google had already bought into this kind of information with their fiber investments and the aqcuisition of Urchin. Yes, I’m pretty sure that the fiber investment was made to be able to monitor web traffic over their fat pipes for the reasons detailed above – even as a “dumb” bandwidth provider. Similar things are already done by companies such as WebSideStory and Hitwise that buy traffic information from big ISPs.

Urchin’s statistics tools can be used for the same purpose in various ways – I’m pretty sure Google will find a way to make it worth an administrator’s trouble to share his Urchin reports with Google for some value-add. So why bother with the Web Accelerator proxy play?

Google already knows the content of every web page on the visible web, now they want to know how we’re all using it.

Good intentions, bad communication

Don’t get me wrong. I’m sure that the gathering of all these data will dramatically improve web search and I marvel the clever technologies and methods in use. What bothers me a bit is that Google is not coming clear about how they will use all this data, or even the fact that they are and to what extent they are gathering it.

I can see only two possible reasons for this lack of communication. The first one being that they don’t want to tell their competitors what they are doing. This can’t be the case, Yahoo, MSN and Ask Jeeves surely have whole divisions that do nothing but reverse engineer Google products, analyze their patent filings and so on. That’s just naive. The second reason I can think of is that they are afraid that users will not be just as willing to share this information if it becomes so clearly visible how massive their data gathering efforts have become.

I’m not an extremely privacy concerned person myself, but I respect, understand and support the opinions of those that are. The amount of user data that is gathered will gradually cause more people to think what would happen if Google turned evil.

- – -

Hjalmar Gislason is the founder of Spurl.net. The above reflects some of the things his company is doing with the Spurl.net bookmarking system and the Zniff search engine.

Advertisement

Share this:

  • Facebook
  • Twitter
  • More
  • Digg
  • Email
  • StumbleUpon
  • LinkedIn
  • Reddit

Like this:

Like
Be the first to like this post.

Posted in english, search | 2 Comments

2 Responses

  1. on June 3, 2005 at 08:08 Jarry

    It might slow the connectoin down quite a bit if everyone starts tracking everything.


  2. on June 21, 2005 at 05:39 atmor

    Sorry for my links



Comments are closed.

  • Hjálmar Gíslason


    A technology enthusiast and general nerd living in Iceland. Founder of four tech-companies. Currently working on DataMarket.

    English only

  • Me elsewhere

    LinkedIn
    Twitter
    Flickr
    Facebook
  • Tweet

    • Heilsusamlegt upplýsingaæði er ekki síður mikilvægt en heilbrigt mataræði: http://t.co/87wMxcSw 6 days ago
    • Note to self: Product development != Software development 1 week ago
    • Meðalfjöldi ferða um Víkurskarð í janúarmánuði 2012 = 523 - http://t.co/e1xKf61L 1 week ago
    • just paged through the least interesting issue of @wired to date (jan 2012) in about 20 minutes. Guys, are you loosing the touch? 2 weeks ago
    • deeply recommends @cjoh 's book, The Information Diet: http://t.co/54cpHiWy Perfect=no. The best "food for thought" I've consumed lately=yes 2 weeks ago
    • @thorarinnh :) 2 weeks ago
    • I've mentioned the value of startup obituaries before. Here's yet another insightful one, this time by @marksoper - http://t.co/GK0FH7pa 3 weeks ago
    • RT @maranomynet: http://t.co/XiYLie3x RT @raggam Gögn um opinbera vefi eru komin í kerfi @datamarket - glæsilegt! 3 weeks ago
    • sér að kálfskinnsskrifarar finna prentvél Gutenbergs allt til foráttu þessa dagana. #21öldinhringdi 3 weeks ago
    • @cjoh Reading your excellent Info Diet book. Criticism: When comparing news analysts etc. to PR people, you ignored: http://t.co/wZsdYSpy 3 weeks ago
  • Hjalli's Flickr photos

    Hrútfjallstindar, Svínafellsjökull og Hvannadalshnjúkur

    Morgunsólin yfir tindunum

    Gengið skýjum ofar

    Svínafellsjökull

    Suður- og Vesturtindar

    More Photos
  • Supports / Styð

  • Top Posts

    • Heilsusamlegt upplýsingaæði
    • Massively Multiplayer Robot Game (virtual reality without the “virtual”)
    • Af iðnaðarsalti
    • Tekjuskattur meðal-Jóns: Sundurliðaður reikningur
    • About / Um Hjalla
  • Category Cloud

    Artificial intelligence Biomimicry Brain technologies Bugs & quirks data datamarket Emergence english features ferðalög General Genetic computing Ideas I like! iphone leitarvélar mobile Næsta Ísland nýsköpun opin gögn Philosophy Robotics search seen Spurl.net Trendwatch tölvur & tækni Uncategorized Þriðjudagstæknin íslenska

Blog at WordPress.com.

Theme: MistyLook by Sadish.


Follow

Get every new post delivered to your Inbox.

Powered by WordPress.com
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.