It starts with a scratchy throat, and (if not treated promptly) progresses to a full-blown belief that content creators everywhere will work together in harmony, and speak with one (meta-)voice. In its origins (in particular, the belief that if we understand what a name/symbol/tag means, then programs will too), it may be related to certain disorders of the AI family.The afflicted are often unaware of its progress, since when applied to small, cohesive communities of technically informed, well-meaning individuals (early Usenet, the Well, the early Web, current RSS feeds), the beliefs actually make some sense. People will use labels correctly, both because they are acting responsibly and because the whole thing hasn’t spun wildly out of control yet in that really cool way that we all hope it will.
Working to make such cohesion _possible_ is a very fine thing. But assuming that the cohesion will persist is nuts if you’re building a search engine. Although standard web servers serving HTML do have a chance to provide a fair amount of metadata to crawlers, _none_ of it can be relied upon in practice to characterize the actual content being served — not even lastmod time or the language of the page. Crawlers must perform all the due diligence.
XML offers the promise of more reliable well-formedness, and so certain kinds of parsing issues may be easier (most of the time) if and when all the interesting content is provided that way. But I see no reason to believe yet that the self-description story will be any different this time around, particularly after there starts to be money in this stuff. So do yourself a favor, and ask your doctor about the free SWD screen when you get your next mental checkup.
8 thoughts on “Semantic Web Disease”
Wow, this is the world’s only hit for “semantic web disease”:
Once again, Timboy speaks where no one else dare speak… 😉
Ah, the myths of the semantic web. I see the perspective you are highlighting. Meta HTML tags, volunteered by document authors, cannot be trusted, certainly not blindly. We can expect the same mess from XML “semantic” tags. Yet, I am optimistic: judging among other things by the heavily publicized experimenting on-going around the self-healing properties of collaborative systems like Wikipedia, I think there is hope. While not directly relevant to the semantic web itself, Wikipedia shows that trust can be built in large scale voluntary systems. In the case of structured information, if there is a lot to be gained (in terms of processing automation) from attaching the right meaning or meta-information to data, there is a good chance for shared meaning to converge instead of diverge.
Wikipedia doesn’t make money. That’s why there isn’t any spam yet.
(1) The Semantic Web isn’t really about web pages, or even annotating text. It’s about sharing and combining data (in general) over the web. They just frame discussions in terms of web pages, because that’s something people understand.
(1.5) So, saying that “people won’t enter metadata” isn’t all that relevant. Because programs regularly produce hoards of metadata, all the time. (I mean: the data that programs normally keep- much of it is sharable information, that can be reused by other programs.) This data doesn’t require people to do anything, really- just programmers.
(2) There are a number of data formats that don’t feature well defined meanings, and they work out all right. For instance, a lot of program ask for your “name.” What’s your name? Is that your full name? Last name? First name? Before or after marriage? Or is it a True name, like Geds? While two people discuss the questions in depth in the corner, and one person proclaims on a soap box that no program can ever run because we can’t really define things, everyone else just types in their name, however they think of it, and use the program.
(3) Webs of trust are generally assumed when people talk about semantic web stuff. Either that, or telling your computer, “I approve of this, not that,…” Which is not much different than saying, “I’ll download this program, I don’t think it’s spyware, but not that program, which I don’t trust.”
Here’s a really smart post by Clay Shirky [via Jeremy] on tradeoffs between controlled vocabularies for meta