Overwhelming number of search-engines
in the WWW like Google, AltaVista, Lycos, InfoSeek
etc. are spider-based. An understanding of how they
work can greatly help you make the best out of them.
Though the term "search engine" is often
used to describe all kinds of retrieval tools, spider-based
search engines differ considerably from human-powered
directories. We discussed human-powered directories
in last issue, this week we take a close look at spider-based
search engines.
Unlike directory-type search engines,
spider-based search engines (also called crawlers,
robots, worms) seek out webpages by 'crawling' through
the WWW and automatically index sites using its own
indexing rules or algorithm.
By simply telling the search engine what
your URL is, its software robot will go there automatically
and index everything they need. How much it will index
and to what degree depends upon its algorithm - a
closely guarded secret in many cases.
Parts of Spider-Based
Search Engine
Spider-based search engines have three
major elements:
-
Spider
-
Index
-
Search
The spider or crawler, as its name implies,
crawls through the WWW, finds web page, reads it,
and then follows links to other pages within the site.
It repeats this process at regular intervals to check
for new information/changes in the page.
Information collected by the spider goes
into the second part of the search engine - the index.
The index is like a giant book containing a copy of
every web page that the spider finds. If a web page
changes, then this book is updated with new information.
The above two parts work in the background,
we only get to see the third part of a search engine
- the search software. This is a computer program
that sifts through the millions of pages recorded
in the index to find matches to a search and rank
them in an order of relevance. The order of relevance
is entirely decided by its own algorithm.
Features of
Spider-based Search Engine and Implication in Search
Result
The ability of a spider to crawl through
millions of web-pages and creating index without human
intervention makes it very powerful search tool with
extremely broad coverage. The second ability of checking
for changes/new information in indexed pages by re-visiting
them at regular intervals and keeping the index up-to-date,
again without human intervention - is really awesome.
However, the greatest strength of spider-based
search engine is also its greatest weakness. Great
coverage and absence of human editing ensures significant
amount of junk or useless information in search result.
This is particularly so when search query is loosely
worded.
The key to get the best out of a spider-based
search engine is to understand some basics of searching.
We shall discuss a few tips that can get you significantly
better search result in next issues.