Google miner

Google is an extremely powerful tool. Don’t worry, I’m not joining the “Google is too powerful” debate, it’s outside Wetware’s scope anyway. But Google is more than just the simple text search. One of the brilliant things here being Google’s web APIs. That’s right; Google is allowing us – the nerd herd – to use its powerful search engine and database to make apps of our own.

I wish I had had the time to play around with this somewhat, but a lot of people have with very interesting results. See for example some of the clever Google hacks from Douwe Osinga. One of his projects, Google History, inspired the following idea for a information mining tool using the Google APIs…

If you didn’t go to his site, Douwe’s Google History is a relatively simple application, where you enter a notable event in history, e.g. “Man lands on moon”, “Ronald Reagan elected president” or “Elvis Presley dies”. Using Google, the application then uses a clever series of search queries to try to determine the year when the event occurred. I suggest you visit the site for the details.

This made me wonder whether similar approach couldn’t be used to mine the web and make a little more sense of some of the vast information available there.

Here’s one approach: Say we want to build a database with information about companies. The information you want to store in your database include the company name, board members, executives, products, industry, etc. All of this is information you would normally find on a company’s web site, although in very different places and form from one company website to another. Using similar methods as Google history, you could feed “Google miner” a list of the names of the companies you want to cover, and the miner would go out there and fill in the rest.

Several tricks would make it more accurate, such as giving the miner synonyms for the terms you’re after (“executives” = “management”, “execs”, etc.), rate higher information found on sites where the URL string is similar to the company name (e.g. http://www.[companyname].com), providing a list of URLs to sites where information you’re after is likely to be found, etc. There are loads of things that can be used to improve this.

The same can be done for almost any category of information you can think of. Samples would include: people, cars, cities, events, novels, animal species, sports and software products, just to name a few. You could even use it to gather information about blogsites, other websites and even web servers. Each of these categories can obviously be broken down into sub-categories, e.g. instead of “people” you might be looking for: “playwrights”, “actors”, “business people”, “politicians in Norway” or “butterfly collectors”.

The miner could spot conflicting patterns (e.g. different birth dates and other personal info for a name search) to determine that there are two people/companies/bloggers with the same name or indicate such discrepancies to the user for him to determine what to make of it.

Crawlers and spiders sometimes serve this purpose up to a point, but why go to that trouble when Google has done all the work already. What I envision is an application where you can feed in an XML schema (it could be wrapped in a user friendly interface for the normal users) for the information you want to gather, define a few rules (e.g. only search European sites or sites where a certain phrase can be found) and tricks (like sites you believe are likely and reliable sources of the wanted information, the synonyms for the search terms, likely ranges for number values, etc) and then you can send the “Google miner” out to gather information on almost anything you like.

Many such schemas would be made up on the spot but for common things, users could use ready-made schemas like the ones found in this schema registry at xml.org⚠.

You could also use a miner like this to try to answer specific questions, like someone’s birth date, saving you the trouble of browsing the different sites from a Google search results to spot it yourself. Methods like putting a certainty rating on results based on where and how many instances of a specific piece of information were found; and seek out human help when in trouble (like for the conflicts mentioned before) – could make this a useful tool for a lot of things, enabling among other things searches that we don’t even attempt today as we know that the typical Google search will only return an overload of potential positives.

There are of course shortcomings to the miner described above. Some of them I’ve spotted, others I most certainly haven’t (comments welcomed). But after a suitable period of trial and error to improve on the methods described here, I suspect that the biggest shortcoming would be that as much information as there is on the web, far more of the information we might want to gather is NOT there. As an example, take a random individual in the world and you won’t find the least bit about him or her anywhere. Also, as pointed out in previous Wetware entry on “Gathering Common Sense⚠“, some information is so trivial that nobody ever thought of recording it.

So, is anyone up to the programming task?

(This must be the ultimate “lazy programming”, i.e. just putting the idea out there for someone else to pick it up and do the coding. Please let me know when you’re done

Stay in the loop