|
The Web is a big place, right?
There's millions of websites, with billions of pages of information. And
the major search engines and directories are the best place to start
sifting through that mountain of information to find what you're looking
for, correct?
Not really. You see, the Web is a
much bigger place than you or the search engines ever imagined. There's
something called the Invisible Web (also referred to as the Deep Web or
Hidden Web). It consists of all the information that remains locked deep
within those dynamic, database driven sites. Because database driven
sites are much more efficient at handling huge amounts of information,
they tend to be the richest content sources on the web. And to most
search engines, they're completely invisible.
No Fixed Address
The problem for the search engines comes with the way the information is
retrieved from the database. Older, static websites have a unique URL,
or address, for every page on the site. These pages are static, meaning
the address never changes. This gives a nice, consistent reference to
the search engine, which will follow a link to the page (or follow up on
a URL submission), spider it, extract its interpretation of the content,
and index the page and its URL for future searches.
With dynamic pages, there is no fixed URL. When you click on a link to
request more information on "notebook computer cases", rather than the
link being a direct route to a page on notebook computer cases (looking
something like this: http://www.notebooksgalore.com/accessories/notebookcases.html)
your request is passed along to the database in the form of a query. It
goes and asks the database to serve up all records where the information
in the category field matches the terms "notebook" and "computer" and
"cases". Hence, you would get a URL that would look something like this:
http://www.notebooksgalore.com/catalog/subclass.asp?logon=notebook+computer+cases
The information that is retrieved from the database is brought into a
template page and displayed for the user. There is no fixed address for
the page and there is no static content. What's more, most search
engines aren't even capable of spidering a URL in this format. It's not
written in a language they can understand. All the valuable information
from the database remains invisible to the search engines.
Dimensions of the Invisible Web
A recent study by
BrightPlanet has shown that the Invisible Web is much larger than we
ever expected. Here's a summary of their findings:
-
Public information on the
Invisible Web is currently 400 to 550 times larger than the commonly
defined World Wide Web
-
The Invisible Web contains 7,500
terabytes of information, compared to 19 terabytes of information in
the surface Web
-
The Invisible Web contains
nearly 550 billion individual documents compared to the 1 billion of
the surface Web
-
More than an estimated 100,000
Invisible Web sites presently exist
-
60 of the largest Invisible Web
sites collectively contain about 750 terabytes of information -
sufficient by themselves to exceed the size of the surface Web by 40
times
-
On average, Invisible Web sites
receive about 50% greater monthly traffic than surface sites and are
more highly linked to than surface sites; however, the typical
(median) deep Web site is not well known to the Internet search
public
-
The Invisible Web is the largest
growing category of new information on the Internet
Invisible Web sites tend to be narrower with deeper content than
conventional surface sites
Total quality content of the Invisible Web is at least 1,000 to
2,000 times greater than that of the surface Web
-
Invisible Web content is highly
relevant to every information need, market and domain
More than half of the Invisible Web content resides in topic
specific databases
A full 95% of the Invisible Web is publicly accessible information -
not subject to fees or subscriptions.
To put this information in
perspective, the largest search engine index (Google) claims to index
about 1 billion pages. That would mean that once you include the
Invisible web, even the largest search engine only gives you access to
1/500 of all the information available.
Dealing with Dynamics
So, if you use a dynamic, database driven site, do you just have to give
up on the notion that a search engine will ever be able to index the
content of your site? Absolutely not. There are several workarounds we
can use to start providing search engines with a way to access the
content on your site.
Translation script
One recent solution is a small server side translation script that
replaces some of the characters in a dynamic URL that stop search
engines cold. The biggest culprit in this regard is the question mark
character (?) that generally precedes a database query. Once they
encounter this character, search engines usually go no further. The
server site script swaps out these characters and replace them with
benign substitutes that still allow your dynamic site to function but
will provide search engines with spiderable URLs.
A word of caution here. Before taking this step on your main site, make
sure you thoroughly test these script on a production server to make
sure everything runs as expected. I've learned from past experience to
take nothing for granted when dealing with databases and queries.
Static Snapshots
Another way you can entice search engine spiders to visit your site is
to create static pages from your dynamic content. When you do a dynamic
search, it takes the information from your database and creates a
temporary HTML page so it can display in a browser. It's possible to
copy this page, save it as a static HTML page somewhere on your site,
and then submit the page to search engines. This would take your dynamic
content and give it a permanent home.
Now, if your site has hundreds or thousands of products, creating static
snapshots for each product could take a long, long time. We usually
start with category pages and the most popular products and work out
from there.
HTML content pages
Another solution is to create a hybrid site with a static index and
category pages, along with content rich information pages. This approach
would give your site at least one level the search engines could spider
before it encountered the dynamic portion of your site.
All Flash, No Substance
Another challenge presented to search engines comes with sites created
in Macromedia's
Flash. Although it's a wonderful creative tool, Flash is not the
search engine's best friend. The problem is that content in a Flash site
is hidden in a proprietary file format, not in HTML. Search engines
can't access the content.
Macromedia has tried to provide a workaround for this by giving you the
option of taking any text from the Flash file and inserting it as a
comment tag on the HTML page containing the Flash file tags.
Unfortunately, many search engines ignore comment tags or gives very
little relevancy to any text found in them.
The Directory Dilemma
At first glance, human edited directories such as Yahoo and Looksmart
might seem to provide a solution to the Invisible Web problem. After
all, directories determine site relevance based on the description and
site title you provide when you submit your site. They don't have to
spider and index the content of your site.
Unfortunately, directories are even less likely to provide a true
snapshot of the information available on the web. The average length of
a site title and description combined is usually no more than 15 or 20
words. That's all you have to describe what your site is about. Can you
do justice to a site that contains 100,000 pages of information in 15 or
20 words?
Seek Professional Help
If you have a site that uses databases or Flash, it's well worth the
money to retain the services of a reputable search engine optimization
consultant. Ask how much experience they have in dealing with dynamic
sites. Question them on the strategies they would use to get around the
problem of pages that can't be spidered. If their entire answer is
doorway pages (or one of the thousand other names that they go by now)
that are hosted on their server, keep searching. This isn't the answer
you're looking for. |