Google is an extremely powerful tool. Don’t worry, I’m not joining the “Google is too powerful” debate, it’s outside Wetware’s scope anyway. But Google is more than just the simple text search. One of the brilliant things here being Google’s web APIs. That’s right; Google is allowing us – the nerd herd – to use its powerful search engine and database to make apps of our own.
I wish I had had the time to play around with this somewhat, but a lot of people have with very interesting results. See for example some of the clever Google hacks from Douwe Osinga. One of his projects, Google History, inspired the following idea for a information mining tool using the Google APIs…
If you didn’t go to his site, Douwe’s Google History is a relatively simple application, where you enter a notable event in history, e.g. “Man lands on moon”, “Ronald Reagan elected president” or “Elvis Presley dies”. Using Google, the application then uses a clever series of search queries to try to determine the year when the event occurred. I suggest you visit the site for the details.
This made me wonder whether similar approach couldn’t be used to mine the web and make a little more sense of some of the vast information available there.
Here’s one approach: Say we want to build a database with information about companies. The information you want to store in your database include the company name, board members, executives, products, industry, etc. All of this is information you would normally find on a company’s web site, although in very different places and form from one company website to another. Using similar methods as Google history, you could feed “Google miner” a list of the names of the companies you want to cover, and the miner would go out there and fill in the rest.
Several tricks would make it more accurate, such as giving the miner synonyms for the terms you’re after (“executives” = “management”, “execs”, etc.), rate higher information found on sites where the URL string is similar to the company name (e.g. http://www.[companyname].com), providing a list of URLs to sites where information you’re after is likely to be found, etc. There are loads of things that can be used to improve this.
The same can be done for almost any category of information you can think of. Samples would include: people, cars, cities, events, novels, animal species, sports and software products, just to name a few. You could even use it to gather information about blogsites, other websites and even web servers. Each of these categories can obviously be broken down into sub-categories, e.g. instead of “people” you might be looking for: “playwrights”, “actors”, “business people”, “politicians in Norway” or “butterfly collectors”.
The miner could spot conflicting patterns (e.g. different birth dates and other personal info for a name search) to determine that there are two people/companies/bloggers with the same name or indicate such discrepancies to the user for him to determine what to make of it.
Crawlers and spiders sometimes serve this purpose up to a point, but why go to that trouble when Google has done all the work already. What I envision is an application where you can feed in an XML schema (it could be wrapped in a user friendly interface for the normal users) for the information you want to gather, define a few rules (e.g. only search European sites or sites where a certain phrase can be found) and tricks (like sites you believe are likely and reliable sources of the wanted information, the synonyms for the search terms, likely ranges for number values, etc) and then you can send the “Google miner” out to gather information on almost anything you like.
Many such schemas would be made up on the spot but for common things, users could use ready-made schemas like the ones found in this schema registry at xml.org.
You could also use a miner like this to try to answer specific questions, like someone’s birth date, saving you the trouble of browsing the different sites from a Google search results to spot it yourself. Methods like putting a certainty rating on results based on where and how many instances of a specific piece of information were found; and seek out human help when in trouble (like for the conflicts mentioned before) – could make this a useful tool for a lot of things, enabling among other things searches that we don’t even attempt today as we know that the typical Google search will only return an overload of potential positives.
There are of course shortcomings to the miner described above. Some of them I’ve spotted, others I most certainly haven’t (comments welcomed). But after a suitable period of trial and error to improve on the methods described here, I suspect that the biggest shortcoming would be that as much information as there is on the web, far more of the information we might want to gather is NOT there. As an example, take a random individual in the world and you won’t find the least bit about him or her anywhere. Also, as pointed out in previous Wetware entry on “Gathering Common Sense“, some information is so trivial that nobody ever thought of recording it.
So, is anyone up to the programming task?
(This must be the ultimate “lazy programming”, i.e. just putting the idea out there for someone else to pick it up and do the coding. Please let me know when you’re done 😉
Yes it would be great to have a proper hands on tool to dig into the internet for data. But as for this google functionality I would think it would be hard to get anything more then the most obvious data from it, (for historical events, company info and most encyclopedic data it would be very difficult to surpass Atomica/Gurunet). But individual data, such as management and employees of companies (and dread I say emails addresses) as you point out might be extracted with more ease than with regular crawlers.
What I would be most interested in would be a datacollector/crawler with a proper query interface that utilizes not only the data on web pages but also the info they provide access to, phone directories and such. This would of course drive everybody up the wall, but it would be very cool. (Apparently Rumsfeld also thought this would be cool, but it looks like TIA has been stopped/postponed)
I don’t think this tool would surpass specialized databases like the ones found for example in Atomica/Gurunet, CIA’s World Factbook for country info, IMDB for movie facts or Amazon for book info. What the Google Miner could do is helping us building new and even more specialized such databases and take out a lot of the manual labor in doing so.
I totally agree on the value in linking to other sources of information as well (e.g. phone directories as you mention). This was partially what I meant by “sites you believe are likely and reliable sources of the wanted information”.
So I believe we have a new feature request for our application: One of the “tricks” should be allowing the user to define queries to available online (and probably also local) resources.