It’s always nice when the borderline between valuable content and webspam content is clear-cut – in that case, the goal of the search engine is straightforwardly to keep the spam out of search results. Unfortunately, some quality issues are a continuum, from the best of the web to the worst.

One of these issues is what to do about “aggregators” – sites and pages that live only to display arrangements of links and bits of content drawn from other sources. The range is continuous, from very tuned and sophisticated content-clustering and layout engines to the worst kinds of scraper spam.

For high-quality aggregation, try Google News, or ScienceBlogs (which JP turned me onto recently). Google News shows a clustering of news stories that is famously untouched by human hands. I don’t know for sure that ScienceBlogs is entirely an aggregation, without any new content, but it looks that way.

Search engines usually want to show content that provides unique value to users – collecting up a bunch of content found elsewhere seems to violate that. On the other hand, are these high-quality sites? Absolutely. And can pure compilation or aggregation add value? Well, I-am-not-a-lawyer, but apparently copyright law gives at least some thin and guarded support for compilation copyrights. And if humans can get credit for assemblage, I think we should extend the courtesy to algos too.

So some high-quality aggregation sites do belong in a websearch index. With that said, you might not always want them to come out on top in a search. If a query matches some snippet of a story on ScienceBlogs, then probably the original story itself would be a better match – but let’s rely on differential relevance algorithms to sort that out.

At the other end of the spectrum, check out this portion of a doc I found while doing an ego search on a websearch engine:

If I am interested in all things Converse (and I am, I am!) then I should really be interested in this doc …. but I’m not interested. There’s no discernible cleverness in the grouping, and no detectable relevance sorting being applied to the search results. This doesn’t belong in a websearch index, as it’s hard to imagine any querier being satisfied.

Other variants of this kind of spam technique cross the line between sampling/aggregation and outright content theft. Imagine that you get home one night to find a stranger leaving your house with a sack containing your TV, cell phone, jewelry. You might misunderstand, until we explain that he’s actually an _aggregator_ – he’s just _aggregating_ your belongings. Yeah, that’s it.

As an in-between case ask yourself this: if you’re doing a websearch (on Google, Yahoo!, MSN, …) do you want any of the results to be … search-result pages themselves (from Google, Yahoo!, MSN)? That is, if you search for “snorklewacker” on MSN web search, and you click on result #4, do you want to find yourself looking at a websearch results page for “snorklewacker” on Yahoo! Search, which in turn has (as result #3) the Google search results page for “snorklewacker”? (It’s easy to construct links that create specific searches and embed them in pages that will be crawled, so this could happen if search engines didn’t take steps (even by removing themselves via robots.txt).)

For one thing this seems like a dangerous recursion that threatens to blow the stack of the very Internet itself. (It’s for reasons like this that I generally wear protective goggles when searching at home – safety first.) But mainly, it just doesn’t seem to be getting anywhere. Search engines are aggregators themselves – on-demand aggregators that show you a new page of results based on the terms you typed. What’s the point of fobbing you off on another term-based aggregator?

Blog aggregators, search engines, tag pages – all of these are fine things as starting points, but the potential for tail-chasing is pretty high if they all point to each other. I say the bar for inclusion ought to be pretty high (though as always it’s just MHO, and not to be confused with my employer’s O at all).

7 thoughts on “Aggregation”

  1. Just a little twist on Greg’s question;
    I run a sailing focused directory and find that many sites that I list are not properly optimised and as a result people searching for those sites find mine first and then go on to the relevant site.
    Pretty good argument for human compiled directories I think.

  2. Interesting post Tim, some very good points brought up.

    To be fair, it is very difficult to decide what is and isn’t a valid aggregator website, the real question is where do you draw the line as a search engine which is indexing these websites. Afterall if someone scrapes a bunch of SERPS and mixes the order up then is this as good as someone who manually looks for similar sites and puts up a directory? Just because they put more effort into it doesn’t mean it’s a more valuable resource.

    Also another point worth considering is just because they organise the data is pretty tables and include some images doesn’t mean they’re offering a better resource than someone who just puts the raw data into a few tables as you have shown above.

    It’s a tricky one.

  3. Greg, Alan, Esrun –

    Sure, I think of high-quality directories as adding value, and in general you want them at least available to search engines in the index.

    Where such directories should rank for most queries is a different question. In some cases, the ultimate user target _is_ actually a directory or survey. More often, though, the user really ultimately wants a single document as destination. If the search engine can deliver the single target at #1, then that is better than delivering a directory that points to the destination. Minimizing the number of clicks needed for satisfaction is the name of the game.

    To Alan’s point, search engines should deliver directories when the query is too general for a specific document _or_ when (for whatever reason) the engine can’t find the right specific docs. I hadn’t thought of directory SEO as a value-add as Alan points out, but it does make sense.

    To Esrun’s point, I agree that it’s a tricky line to draw – but does anyone think that the “Converse” scraper spam example is on the good side of the line?

  4. Tim,

    I wouldn’t say that the converse example is on the good side but I personally see three sides:


    A good example would be a directory which has re-organised results in some way that would better match the users needs or provides some further level of service such as screenshots of the pages.

    An average example would be one like shown in this converse example, it gathers relevant links from other search engines.

    A bad example would be one which offers very little value such as those who scrape SERPS but remove links to the actual pages or make it very difficult to click on anything other than an advert.

    I think alot of it comes down to your own SEO methods, those in the BH industry will obviously have broader opinions of what is acceptable.

  5. With regards to the “Converse” scraper spam example, I think that the rule of thumb here from the point of view of the searcher is; “Did I benefit from my experience with this page”

    The rule of thumb for publishers is “What is my intent?” and, unfortunately no algo will be able to honestly fathom intent.

    The page may have the greatest intent in the world, but be presented poorly and this is where the searcher needs to ask, “Did I benefit from my experience with this page”.

    Wouldn’t it be great if people did not feel the need to artifically inflate results?

    I would be interested to see the level and placement of Advertising surrounding the ‘content’ of the ‘Converse results’ page. That would be the kicker for me.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s