There is one huge disadvantage to being king of the mountain: Everyone is trying to knock you off. So it's no surprise to see Google's crown jewel -- search -- threatened by technologies aimed at tapping the "Deep Web."
The term refers to mountains of unmined data trapped in the nooks and crannies of databases that are invisible to traditional search engines. Google (Nasdaq: GOOG), for example, uses crawler or spider programs to gather information through the paths of hyperlinks -- a method sources say can't produce the same results as penetrating, in-depth queries.
One of these sources, startup BrightPlanet, holds that the Deep Web is 550 times the size of the surface Internet. In the same vein, Anand Rajaraman, co-founder of vertical search engine firm Kosmix, says, "The crawlable Web is the tip of the iceberg." Accordingly, instead of using key words, Kosmix's site looks more like a Web portal than a typical search engine.
A Deep Web search pulls together audio, text, and video content based on the underlying concept behind the question being asked. The exploration of information outside of just text can be more like browsing than searching.
"Most of the search engines help you find a needle in a haystack. What we're trying to do is help you explore the haystack," says Rajaraman. "Sometimes we are not looking for a specific piece of information but something broader... Sometimes it's far more valuable to get the opinion and discussion and commentary around things."
Other projects use similar approaches to searching within a context instead of tracking a specific term. Utah University professor Juliana Freire, for instance, is working on a beta project called DeepPeep. The goal is to crawl every public database and index, extracting "far-flung" nuggets of content.
Instead of asking all the questions of all the words in the dictionary, Freire contends, "DeepPeep starts by presenting a small number of sample queries, so we can use that to build up our understanding of the databases and choose which words to search." Once that information is analyzed, the program begins an automated search based on those terms, unearthing as much data as possible.
In a report Freire co-authored last fall, she wrote: "We have designed HIerarchical Form Identification (HIFI), a new method for automatically classifying forms with respect to a database domain that is both scalable and accurate."
As this technology continues to evolve and may even come to realize the vision of the semantic Web, don't expect Google to rest on its laurels -- although the vendor has resisted a complete overhaul of its search algorithms (ostensibly so as not to alienate users by complicating its Web pages).
Google's own Deep Web strategy includes a program that analyzes content in every database it encounters and a predictive model of what the database holds.
There may be road blocks along the way for the Deep Web; information can have intellectual property rights attached as well as licensing fees, for instance. Still, as of now, 95 percent of Deep Web content is publically accessible, for free.
Work continues on Deep Web, even as Google posted its one trillionth address page last July. The impact of the research is still unknown; some think it won't be felt by casual Web users but instead by the business community.
Any way you look at it, mountains of unmined data lie waiting to be tapped. And Google may not get there fast enough.
— Chris Poley has been a professional trader for over 20 years
This blog is part of Internet Evolution's IT Clan, which addresses the continuing impact of the Internet on enterprise networks, applications, and management. Register here to join the IT Clan's conversation, and you just might win something unspeakably cool.