![]() Financial Daily from THE HINDU group of publications Monday, Jan 23, 2006 |
|
|
|
|
|
|
|
eWorld
-
Security Info-Tech - E-Mail Columns - IT Works More light on dark stuff D. Murali
WE are all so tired of e-mail spam that Julie Morgenstern's advice `Never Check E-Mail in the Morning' seems to be a helpful solution. But how do you handle Web spamming? This is the menace that successfully misleads search engines into "ranking some pages higher than they deserve," write Zoltán Gyöngyi and Hector Garcia-Molina of Stanford University's Computer Science Department in a research paper titled `Web Spam Taxonomy'. An example that the authors give is of results obtained for the query `Kaiser pharmacy'. The second result they got was a page on the spam Web site techdictionary.com, which contained "only a few lines of useful information (mainly some term definitions, probably copied from a real dictionary)" but consisted of "thousands of pages, each repeating the same content and pointing to dozens of other pages". The authors postulate that all the pages were probably created to boost the rankings of some others, and none of them seemed to be particularly useful for anyone looking for pharmacies affiliated with Kaiser-Permanente, (which is "America's leading integrated health care organisation," as www.kaiserpermanente.org informs). When cyber space is unlimited, should Web spam bother us? Yes, aver the authors, because our search results degrade; also, there can be financial gain to perpetrators of Web spam when users pay attention to highly ranked sites. Don't forget that spam inflates the indexes of search engines with `useless pages' and thus increases `the cost of each processed query'. It is, therefore, necessary to combat spam, argue the authors. They lay out "the first comprehensive taxonomy of all important spamming techniques known to date." For, to fight the evil, you must know what all cloaks the mischief dons.
Definitions and measures
The paper begins with a few definitions and explanations of concepts. "The objective of a search engine is to provide high-quality results by correctly identifying all Web pages that are relevant for a specific query, and presenting the user with some of the most important of those relevant pages," the authors explain. Measure of `relevance' is `textual similarity between the query and a page', and is a numeric score that is query-specific. `Importance' is query-independent; for example, "pages with many incoming links are more important". Taking both factors into account, search engines compute `a combined ranking score' for presenting results to a user. "In evaluating textual relevance, search engines consider where on a web page query terms occurs," explain the authors. "Each type of location is called a field. The common text fields for a page `p' are the document body, the title, the meta tags in the HTML header, and page p's URL." TFIDF or `text field inverse document frequency', a metric used in information retrieval, is fundamental to many algorithms used by search engines to rank Web pages. "Given a specific text field, for each term `t' that is common for the text field and a query, TF(t) is the frequency of that term in the text field," explains the paper. "For instance, if the term `apple' appears 6 times in the document body that is made up of a total of 30 terms, TF(`apple') is 6/30 = 0.2." IDF(t) of a term t is related to the number of documents in the collection that contain `t'. Thus, "if `apple' appears in 4 out of the 40 documents in the collection, its IDF(`apple') score will be 10." Two well-known algorithms used for computing importance scores are HITS and PageRank. According to HITS, "important hub pages are those that point to many important authority pages, while important authority pages are those pointed to by many hubs." PageRank (of Google founders Lawrence Page and Sergey Brin) uses incoming link information to assign global importance scores to all pages on the Web. "It assumes that the number of incoming links to a page is related to that page's popularity among average Web users." Spamming a.k.a. spamdexing is defined as "any deliberate human action that is meant to trigger an unjustifiably favourable relevance or importance for some Web page, considering the page's true value." Three types of pages from a spammer's angle are inaccessible (out of reach), accessible (such as blog comments), and own (where control over contents is complete and so the group of own pages is a spam farm). The phrase `term spamming' refers to techniques that aim at tailoring the contents of text fields to make spam pages relevant for certain queries. Since spammers can have no real control over the IDF scores of terms, they work on increasing the TF score, and thereby the TFIDF score, which is a product of TF and IDF. SEOs (search engine optimisers) such as www.seoinco.com help create well-structured, high-quality pages to add benefit for the web community. However, the authors rue that most SEOS engage in spamming, justifying it as `ethical' Web page positioning or optimisation.
Boosting and hiding
Techniques discussed in the paper fall under two categories, viz. boosting and hiding. The first is about methods aimed at achieving "high relevance and/ or importance for some pages." And the second category, `hiding', includes techniques that are aimed not so much at influencing the search engine's ranking algorithms, as at hiding "the adopted boosting techniques from the eyes of human Web users." Term spamming is dissected as follows `based on the text field in which spamming occurs':
Based on the type of terms added to text fields, the papers categorises term spamming techniques as repetition (of a few specific terms), dumping (often even entire dictionaries!), weaving (spam terms inserted into copied news articles to cheat algorithms that filter out plain repetition), and phrase stitching (by gluing together sentences and phrases from different sources). Boosting is also possible through `link spamming' too, whereby the spammer adds outgoing links to popular pages or gathers many incoming links to a single target page or a group of pages. For instance, to beat HITS, a spammer may rig the hub scores "by adding outgoing links to a large number of well known, reputable pages, such as www.cnn.com or www.mit.edu," point out the authors. "Obtaining a high authority score is more complicated, as it implies having many incoming links from presumably important hubs." Instead of manually adding numerous outgoing links, spammers take a shortcut: directory cloning, using sites such as DMOZ Open Directory (dmoz.org), Yahoo! Directory (dir.yahoo.com), Librarian's Index to the Internet (lii.org). "These directories organise Web content around topics and subtopics, and list relevant sites for each. Spammers then often simply replicate some or all of the pages of a directory, and thus create massive outgoing-link structures quickly." To accumulate incoming links, techniques described in the taxonomy are:
One learns from the paper that to hide `the telltale signs' like repetition and long list of links, spammers use many techniques, such as: colour schemes (so the terms are in the same colour as the background); "tiny 1x1-pixel anchor images that are either transparent or background-coloured"; and scripting to hide visual elements, such as "by setting the visible HTML style attribute to false". Cloaking is more sophisticated; "given a URL, spam Web servers return one specific HTML document to a regular Web browser, while they return a different document to a Web crawler." You may ask if it is wise to reveal spamming secrets. The paper answers the question thus: "Nothing in this paper is secret to the spammers; it is only most of the Web users who are unfamiliar with the techniques presented here." The authors believe that publicising spamming techniques can raise the awareness and interest of the research community. In conclusion, they make a case for `adequate link analysis algorithms' to separate the reputable pages from spam, so that your searches aren't overwhelmed by the chaff of spam cloud. Picture by R. K. Mustafah
More Stories on : Security | E-Mail | IT Works
Article E-Mail :: Comment :: Syndication :: Printer Friendly Page
|
Stories in this Section |
|
The Hindu Group: Home | About Us | Copyright | Archives | Contacts | Subscription Group Sites: The Hindu | Business Line | The Sportstar | Frontline | The Hindu eBooks | The Hindu Images | Home |
Copyright © 2006, The
Hindu Business Line. Republication or redissemination of the contents of
this screen are expressly prohibited without the written consent of
The Hindu Business Line
|