Exploring The Invisible Web

The Web is a big place, right? There's millions of websites, with billions of pages of information. And the major search engines and directories are the best place to start sifting through that mountain of information to find what you're looking for, correct?

Not really. You see, the Web is a much bigger place than you or the search engines ever imagined. There's something called the Invisible Web (also referred to as the Deep Web or Hidden Web). It consists of all the information that remains locked deep within those dynamic, database driven sites. Because database driven sites are much more efficient at handling huge amounts of information, they tend to be the richest content sources on the web. And to most search engines, they're completely invisible.

No Fixed Address

The problem for the search engines comes with the way the information is retrieved from the database. Older, static websites have a unique URL, or address, for every page on the site. These pages are static, meaning the address never changes. This gives a nice, consistent reference to the search engine, which will follow a link to the page (or follow up on a URL submission), spider it, extract its interpretation of the content, and index the page and its URL for future searches.

With dynamic pages, there is no fixed URL. When you click on a link to request more information on "notebook computer cases", rather than the link being a direct route to a page on notebook computer cases (looking something like this: http://www.notebooksgalore.com/accessories/notebookcases.html) your request is passed along to the database in the form of a query. It goes and asks the database to serve up all records where the information in the category field matches the terms "notebook" and "computer" and "cases". Hence, you would get a URL that would look something like this: http://www.notebooksgalore.com/catalog/subclass.asp?logon=notebook+computer+cases
The information that is retrieved from the database is brought into a template page and displayed for the user. There is no fixed address for the page and there is no static content. What's more, most search engines aren't even capable of spidering a URL in this format. It's not written in a language they can understand. All the valuable information from the database remains invisible to the search engines.

Dimensions of the Invisible Web

A recent study by BrightPlanet has shown that the Invisible Web is much larger than we ever expected. Here's a summary of their findings:

  • Public information on the Invisible Web is currently 400 to 550 times larger than the commonly defined World Wide Web

  • The Invisible Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the surface Web

  • The Invisible Web contains nearly 550 billion individual documents compared to the 1 billion of the surface Web

  • More than an estimated 100,000 Invisible Web sites presently exist

  • 60 of the largest Invisible Web sites collectively contain about 750 terabytes of information - sufficient by themselves to exceed the size of the surface Web by 40 times

  • On average, Invisible Web sites receive about 50% greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet search public

  • The Invisible Web is the largest growing category of new information on the Internet
    Invisible Web sites tend to be narrower with deeper content than conventional surface sites
    Total quality content of the Invisible Web is at least 1,000 to 2,000 times greater than that of the surface Web

  • Invisible Web content is highly relevant to every information need, market and domain
    More than half of the Invisible Web content resides in topic specific databases
    A full 95% of the Invisible Web is publicly accessible information - not subject to fees or subscriptions.

To put this information in perspective, the largest search engine index (Google) claims to index about 1 billion pages. That would mean that once you include the Invisible web, even the largest search engine only gives you access to 1/500 of all the information available.

Dealing with Dynamics

So, if you use a dynamic, database driven site, do you just have to give up on the notion that a search engine will ever be able to index the content of your site? Absolutely not. There are several workarounds we can use to start providing search engines with a way to access the content on your site.

Translation script

One recent solution is a small server side translation script that replaces some of the characters in a dynamic URL that stop search engines cold. The biggest culprit in this regard is the question mark character (?) that generally precedes a database query. Once they encounter this character, search engines usually go no further. The server site script swaps out these characters and replace them with benign substitutes that still allow your dynamic site to function but will provide search engines with spiderable URLs.

A word of caution here. Before taking this step on your main site, make sure you thoroughly test these script on a production server to make sure everything runs as expected. I've learned from past experience to take nothing for granted when dealing with databases and queries.

Static Snapshots

Another way you can entice search engine spiders to visit your site is to create static pages from your dynamic content. When you do a dynamic search, it takes the information from your database and creates a temporary HTML page so it can display in a browser. It's possible to copy this page, save it as a static HTML page somewhere on your site, and then submit the page to search engines. This would take your dynamic content and give it a permanent home.

Now, if your site has hundreds or thousands of products, creating static snapshots for each product could take a long, long time. We usually start with category pages and the most popular products and work out from there.

HTML content pages

Another solution is to create a hybrid site with a static index and category pages, along with content rich information pages. This approach would give your site at least one level the search engines could spider before it encountered the dynamic portion of your site.

All Flash, No Substance

Another challenge presented to search engines comes with sites created in Macromedia's Flash. Although it's a wonderful creative tool, Flash is not the search engine's best friend. The problem is that content in a Flash site is hidden in a proprietary file format, not in HTML. Search engines can't access the content.

Macromedia has tried to provide a workaround for this by giving you the option of taking any text from the Flash file and inserting it as a comment tag on the HTML page containing the Flash file tags. Unfortunately, many search engines ignore comment tags or gives very little relevancy to any text found in them.

The Directory Dilemma

At first glance, human edited directories such as Yahoo and Looksmart might seem to provide a solution to the Invisible Web problem. After all, directories determine site relevance based on the description and site title you provide when you submit your site. They don't have to spider and index the content of your site.

Unfortunately, directories are even less likely to provide a true snapshot of the information available on the web. The average length of a site title and description combined is usually no more than 15 or 20 words. That's all you have to describe what your site is about. Can you do justice to a site that contains 100,000 pages of information in 15 or 20 words?

Seek Professional Help

If you have a site that uses databases or Flash, it's well worth the money to retain the services of a reputable search engine optimization consultant. Ask how much experience they have in dealing with dynamic sites. Question them on the strategies they would use to get around the problem of pages that can't be spidered. If their entire answer is doorway pages (or one of the thousand other names that they go by now) that are hosted on their server, keep searching. This isn't the answer you're looking for.

Gord Hotchkiss
President and CEO
Enquiro Full Service Search Engine Marketing
Search Engine Positioning by Searchengineposition
-------------------------------------------------------------------------------
Copyright 2005 - Enquiro Search Solutions.
This article can be reproduced in its entirety, if the author credit is retained and there is a prominent source link to www.enquiro.com.
Visit our technical and news site www.searchengineposition.com.