Google and user driven search indexes

Needless to say, I am a firm believer that the next big steps in web search will come from involving the users more in the ranking and indexing process.

The best search engines on the web have always been built around human information. In the early days, Yahoo! was the king, first based on Jerry Yang’s bookmark collection and later on having herds of human editors categorizing the web building on top of Jerry’s collection. The web’s exponential growth soon blew this model and for a year or so, the best ways to search the web involved trying to find one or two useful results in full pages of porn and casino links on Altavista or HotBot.

Then Google came along and changed everything with their genius use of the PageRank algorithm, assuming that a link from one web page to another was a vote for the quality and relevance of the content of the linked page. You know the story. Today, all the big engines use variations of this very same core methodology, obviously with a lot of other things attached to fight spam and otherwise improve the rankings.

Next steps: From page content to user behavior

But the next step is due and the big guys in the search world are really starting to realize this. A lot of the recent innovations coming from the search engines – especially from Google – are targeted directly at gathering more data about user behavior that they can in turn use to improve web search.

Among the behavior information that could be used for such improvements are:

Bookmarks: A bookmark is probably the most direct way a user can say “I think this page is of importance”, additionally users usually categorize things that they bookmark, and some browsers and services – such as my very own Spurl.net – make the bookmarks more useful by allowing users to add various kinds of meta data about the pages they are bookmarking. More on the use of that in my recent “Coming to terms with tags” post.
History: What pages has the user visited? The mere fact that a user visited a page does not say anything about it other than that the page owner somehow managed to lure the user there (usually with good intentions). The fact that a page is frequently visited (or even visited at all) does however tell the search engine that this page is worth visiting to index. This ensures that fresh and new content gets discovered quickly and that a successful black hat SEO tricks (in other words “search engine spam”) don’t go around for long without getting noticed.
Navigational behavior: Things such as:

which links users click

which links are never clicked
how far down do users scroll on this page
how long does a user stay on a page
do users mainly navigate within the site or do the visit external links
etc.

All of these things help the search engine determining the importance, quality and freshness of the page or even of individual parts of the page.

All of these things and a lot more are mentioned in one way or another in Google’s recent patent application. Check this great analysis of the patent⚠ for details. Here’s a snip about user data:

traffic to a document is recorded and monitored for changes (possibly through toolbar, or desktop searches of cache and history files) (section 34, 35)
User behavior is websites are monitored and recorded for changes (click through back button etc)(section 36, 37)
User behavior is monitored through bookmarks, cache, favorites, and temp files (possibly through google toolbar or desktop search) (section 46)
Bookmarks and favorites are monitored for both additions and deletions (section 0114, 0115)
User behavior for documents are monitored for trends changes (section 47)
The time a user spends on website may be used to indicate a documents quality of freshness (section 0094)

Google’s ways of gathering data

And how exactly will Google gather this data? Simple: by giving users useful tools such as the Toolbar, Deskbar and Desktop Search that improves their online experience and as a side effect provides Google with useful information such as the above.

I’ve dug through the privacy policies for these tools (Toolbar, Desktop Search, Deskbar) and it’s not easy to see exactly what data is sent to Google in all cases. The Toolbar does send the them the URL of all visited pages to provide the PageRank and a few other services. It is opt-out, but default on, so they have the browser history of almost every user that has the toolbar set up. Other than this, details are vague. Here are two snips from the Desktop search Privacy FAQ:

“So that we can continuously improve Google Desktop Search, the application sends non-personal information about things like the application’s performance and reliability to Google. You can choose to turn this feature off in your Desktop Preferences.”

“If you send us non-personal information about your Google Desktop Search use, we may be able to make Google services work better by associating this information with other Google services you use and vice versa. This association is based on your Google cookie. You can opt out of sending such non-personal information to Google during the installation process or from the application preferences at any time.”

It sounds like error and crash reporting, but it does not rule out that other things, like bookmarks, time spent on pages, etc. are sent also. None of the above mentioned products provides a finite, detailed list of things that are sent to Google.

Next up: Google delivering the Web

And then came the Web Accelerator. Others have written about the privacy, security and sometimes actually damaging behavior of the Web Accelerator, but as this blog post⚠ points out, the real reason they are providing it is to monitor online behavior by becoming a proxy service for as many internet users as possible.

I’m actually a bit surprised with the web accelerator for several reasons. Google is historically known for excellent quality of their products, but this one seems to have been publicly released way too early judging from the privacy, security and bug reports coming from all over the place. Secondly, the value is questionable. I’m all for saving time, but if a couple of seconds per hour open up a range of privacy and security issues I’d rather just live a few days shorter (if I save 2 seconds every hour for the next 50 years that amounts to 10 days)

The third and biggest surprise come from the fact that I thought Google had already bought into this kind of information with their fiber⚠ investments⚠ and the aqcuisition of Urchin. Yes, I’m pretty sure that the fiber investment was made to be able to monitor web traffic over their fat pipes for the reasons detailed above – even as a “dumb” bandwidth provider. Similar things are already done by companies such as WebSideStory⚠ and Hitwise that buy traffic information from big ISPs.

Urchin’s statistics tools can be used for the same purpose in various ways – I’m pretty sure Google will find a way to make it worth an administrator’s trouble to share his Urchin reports with Google for some value-add. So why bother with the Web Accelerator proxy play?

Google already knows the content of every web page on the visible web, now they want to know how we’re all using it.

Good intentions, bad communication

Don’t get me wrong. I’m sure that the gathering of all these data will dramatically improve web search and I marvel the clever technologies and methods in use. What bothers me a bit is that Google is not coming clear about how they will use all this data, or even the fact that they are and to what extent they are gathering it.

I can see only two possible reasons for this lack of communication. The first one being that they don’t want to tell their competitors what they are doing. This can’t be the case, Yahoo, MSN and Ask Jeeves surely have whole divisions that do nothing but reverse engineer Google products, analyze their patent filings and so on. That’s just naive. The second reason I can think of is that they are afraid that users will not be just as willing to share this information if it becomes so clearly visible how massive their data gathering efforts have become.

I’m not an extremely privacy concerned person myself, but I respect, understand and support the opinions of those that are. The amount of user data that is gathered will gradually cause more people to think what would happen if Google turned evil.

– – –

Hjalmar Gislason is the founder of Spurl.net. The above reflects some of the things his company is doing with the Spurl.net bookmarking system and the Zniff search engine.

Stay in the loop