Published on Nov 15, 2015
As more and more people rely on search engines as starting points to fulfill their need for information, it has become absolutely important to have one’s page rank up in the top few results of popular search engines. Most search engines use, among other things, variants of the classic PageRank algorithm, which relies on the link structure of the web to rank pages.
In order to have their pages rank higher than deserving, some web designers, resort to all sorts of tricks to mislead search engines by manipulating linkage (link-spam) and content(term-spam) on their pages and the web, in the process give form to what has come to be called web-spam. There is a continuing clash between search engine algorithm-designers and web-spammers leading to this battleground of the Adversarial Web.
Our main focus in this report is link-spam. We take a look at the different methods of combating link-spam. We also look at optimal link-spam structures and test them using Java code. We implement popular algorithms for ranking algorithms and test the efficacy of these on a web-graph made available by Webaroo.
Introduction of Combating Link Spam
A first step in gearing up for the counter-measures it would be prudent to understand the spammers’ ‘arsenal’. This section elucidates the attempts to organize web-spamming techniques into a taxonomy. It also briefly brushes over published statistics about webspam. There have been discussions in literature and on the web, but we draw heavily from. We use two terms: importance: the ranking of a page in general, and relevance: the ranking of a page with respect to a specific query
To delve into link spamming let’s categorize pages according to the way they can be manipulated by spammers to influence results:
a. Inaccessible pages: Spammers cannot modify these pages. However, they can point to them.
b. Accessible pages: These pages don’t belong to the spammer, but they can modify the content on these pages, in a limited manner. Typical examples are: wikis, comments on blogs.
c. Own pages: The spammer wants to boost ranking of one or more of these pages: target pages, t. These have a cap on budget (e.g. web hosting, etc.).
The target algorithms: HITS, PageRank, TrustRank, etc.
HITS ranks hubs and authority pages.[ 11] For HITS, the spammer can easily obtain high hub scores by adding outlinks to popular websites. Some spammers even pay users of high ranked .edu authority pages to point to their spammy pages. The spammer can obtain high authority scores by having his unscrupulous hub pages point into a page which can now become a hub page
It is important for spammers to conceal their intent from a human visitor. Two techniques used here are:
This involves making the spam invisible from the page. This can be done by changing background color or by having the 1x1 pixel.
Spammers can provide one version to humans and different one to crawlers. This is done by keeping track of IP addresses of crawlers and serving them different content