Size doesn’t matter

The size of the web is not a very well-behaved quantity to estimate (although I was once asked to run experiments to try). A guy who worked at another search engine used to joke that the web grew by 2 billion documents whenever he plugged in his notebook. (He was running a web server that generated dynamic pages with number-embedded links to other such pages — crawl his “site” and you would eventually get a different document for every int.) Certainly there are orders of magnitude fewer unique documents out there than there are URLs that elicit a status 200 from some webserver.

But let’s suppose, for the sake of argument, that there are 10B distinguishably different webpages out there in the “visible web”, under some definition of “distinguishably different”, and with some limits on what any one wacko site can contribute. Now the question is: how much of that 10B do you need to have in your websearch index?

Before you say “all of it”, consider how full of crap the web is, _and_ how duplicative a lot of it is (even if not byte-for-byte duplicative). Then consider what you have left _after_ you take out the “best” N billion pages (under some definition of best). How incrementally useful or novel is the 5,000,000,001st best HTML document on the planet? My answer is: not at all, if the goal is to answer a question or give the user some helpful information.

On the other hand, size matters if the user is looking for some particular URL (not a particular document) and will conclude that the search engine sucks if it can’t be found. And it matters if you search for your own name, and find more mentions of yourself on this search engine than on that one. And finally, of course, it matters because search engine companies are indiscreet and will even go so far as to brag about their index sizes, and users understandably come to believe that size is an important thing.

So if search engine competition is about the survival of the fittest, then large indexes are like large antlers — the selection is strictly sexual. You can’t move as fast, and you have to eat a lot more, but you’ve got a chance to be the sexiest beast out there, for a while.

[These views may or may not be shared by my employer, who will at any rate remain nameless.]

1 thought on “Size doesn’t matter”

  1. This is worse than N Billion web pages cluttering up the WWW:
    *That half of them look like crap* , are poorly designed, and the design was stolen from another site that looks just as bad.
    So, for all the websites out there that the search engines do or do not have listed, *most of them are pitifuly awful.*
    [This is not a comment about your site!]
    Try this search in the search engine of choice:* “+great-web-design, %designed in 2004, +color * — I got a result of only 190 sites. I think that says it all…

    BTW, You need to find a way to delete the previous two comments – you got to hate comment-spammers!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s