Next Generation Search Engine, Anyone?

ML Jul 15, 2006

So, Technorati is reporting over 100 mentions/day of the long tail. The deluge of posts from the Chris Anderson book launch has only been going on for a few days, but the HitTail/MyLongTail site is holding its ground fairly well in the Google default results. We started out on pages 7 and 4 for long tail and longtail respectively. We’ve only been pushed to pages 8 an 5, even with the 400+ new pages that were rolled out over the past few days.

Now, Google results have a time-delay effect, where if those blog posts get linked-to at a steady rate, their positions in Google default search will rise, so the tide of new content may not sweep us under for another month or two. That’s one of the interesting things about Google default search: it’s not a news system so much as a non-real-time popularity contest. In fact, one of Google’s potential areas for next-generation-style improvement would be to make their search reflect the real-time state of the Internet instead of a time-delayed index.

This could be done in at least three ways that come to mind. First, is content providers notifying Google every time new content is released. That was like the old submit system, and is like today’s Technorati ping system. But if we were all pinging Google, no doubt the voice of spammers would drown out everyone else’s. So, that’s no good.

The next approach is their crawlers actually being so massive, omniscient and all-seeing that they can go out and survey the entire state of the Internet several times a day. Today, GoogleBot picks up only a few thousand pages-per-site-per-day, and that’s only popular sites (high PR), and only when the content isn’t “invisible” to Google. The only problems with this approach are bandwidth, processing power and storage capacity. True, all but bandwidth will be approaching unlimited. But still, maybe not viable for the next evolution in search.

The third approach I’m aware of is small-world-theory, where crawlers are sent out in real time to find answers to queries, totally abandoning the “indexing” system. Theoretically, every time you searched, the results could be different, depending on the topology of the Internet at that very second. The problem here is that the interlinking may not be good enough to ferret out the highest quality content every time. There is also a bandwidth issue, because the crawl must happen in real-time instead of the time-delayed “canned ” indexes.

Then of course, there are hybrid approaches combining some or all of the above. And the fact that Google’s results do update on a daily basis is evidence that they’re doing some hybrid techniques. I doubt they’re pushing the entire massive index out to all their datacenters several times/day. I suspect that they’re not using small-world-theory crawls (at least not in default search). So chances are, they have a different class of crawlers dedicated to picking up what’s new and fresh, and distributing it out as incremental updates that modify the last big datacenter push.

Anyway, that has been my rambling on what some aspects of the future of search may bring. In monitoring the popularity growth of HitTail, I find myself first going to Google and Technorati to watch the results fluctuate. I care most about Google, but appreciate the real-time-ness of Technorati. I hunger to view the shifting-state-of-the-Internet itself through Google default results. Shouldn’t that be possible today with Ajax methods? Perform a search, and sit back and watch the results change as the Internet itself does?

Leave a Reply

Your email address will not be published. Required fields are marked *