by Richard Seltzer, seltzer@samizdat.com, www.samizdat.com
Reprinted with permission from Internet Search Advantage, ZD Journals. http://www.zdjournals.com
How to translate this article into French, Spanish, Italian, Portuguese, or German, Comment traduire en français, Cómo traducir a los españoles, Come tradurre in italiano, Como traduzir em portuguêses, Wie man in Deutschen übersetzt.
Did you know that you can check your own site, your partners' sites, and your competitors' sites using AltaVista Search? That you can use LiveTopics to view and compare your findings? And that you can hone your skills at recognizing and interpreting subtle differences among your search results? When you search the Web with AltaVista Search, you're not searching the target sites directly. Rather, you're looking through the AltaVista Search index. In many cases, the index will be complete and current enough to validate your conclusions. But you should make every effort to make sure that's the case.
Before writing any competitive/comparative report based on AltaVista Search research, be sure to take a good look at the sites themselves. It's important to understand the limits of the measuring instrument you're using.
First, check for completeness. When you search for host:yoursite.com or host:competitor.com, AltaVista Search returns the number of matches from the index. If the number is relatively small (fewer than 1,000), the count should be pretty accurate. Perhaps one of your main initial conclusions is that your own site or your competitor's site has relatively few Web pages. If so, you should go to the site in question. Follow all the local links you find on the home page, all the links you find on those pages, and so on, taking notes as you go along. Notice if there are significantly more pages at the site than AltaVista Search says there are. If so, you'll need to look more closely to determine why.
Second, make sure the information is current. Perhaps one of your preliminary conclusions is that information at a target site is stale or inaccurate. You checked the dates shown with each entry in the list of matches, and they were all several months old and, in some cases, a year old. Or, perhaps you used Advanced Search and limited your query not only to the target site but also by range of dates and found lots of old pages. If that's the case, choose several pages with the oldest dates and click to connect to them directly. Then, with your browser, check each page's Document Info. Are the dates of the actual pages significantly more recent than the dates listed in the AltaVista Search index?
Keep in mind that while there are tens of millions of pages in the AltaVista Search index and while Scooter, the AltaVista Search spider, is continually gathering new and updated information with a thousand threads going simultaneously, the index is never absolutely complete and current. It's an excellent approximation, but it's not perfect. If the AltaVista Search index is your measuring instrument, avoid drawing conclusions that are more precise and subtle than is appropriate, given the instrument's accuracy.
So what can you do if the AltaVista Search index indicates that a competitor's or partner's site has only three pages, and you find that it actually has at least a dozen? Jot down the URLs of the missing pages. Go to the AltaVista Search page and, at the bottom of the page, click Add URL. Then enter and submit the URLs of each missing page. By the next day, the new information should be in the index, and you can take another close look and revise your analysis. You can do the same thing to add new versions of pages to the AltaVista Search index, substituting today's content for information that may have been gathered six months ago.
You don't need special authority to add URLs to AltaVista Search's
index. When you enter an address, the crawler immediately fetches
the page in question, and the index is updated overnight based
on the information found. By adding URLs, you're helping to improve
the
index for the benefit of all.
If the number of new or updated pages is more than about a dozen, the fix may not be quick and simple. Sooner or later, you'll get a message that too many pages from that site have been entered. Because of abuses by a few users, the developers of AltaVista Search have imposed limits. Some of these abuses have been simply malicious. Some individuals find spamming an index (automatically feding useless information into an index) a technical challenge as well as a ompulsion. Like creating a virus, they do it just for the sake of proving they an, with no concern for the effect of their actions. In other cases, businesses have tried a variety of tricks they mistakenly believed would give hem an unfair advantage--resulting in their pages coming out higher on esults lists than their competitors' pages.
If you need to add several dozen URLs to the index, you might want to enter nly a dozen each day. If you need to add hundreds or thousands of URLs, the est you can do (under today's limits) is to enter the URLs for the home page and pages that are entry points for significant branches of the target Web site. When the crawler fetches the page, it will send the full text to the indexer and simultaneously capture the list of hyperlinked URLs for later exploration. Eventually, the crawler will fetch and add those pages as well. You should be aware that some sites contain features that prevent AltaVista Search from indexing their contents. In those cases, the data you gather for comparative purposes may be woefully incomplete. If the barrier is at your own site, you may be able to convince your Webmaster to change the basic design of the pages or to create a duplicate set of plain, indexable pages. If the barrier is at a partner's site, you may want to encourage the Webmaster to allow AltaVista Search to index the site so that people will be aware that it exists and that it contains valuable business information. If it's at a competitor's site, you should simply give up on the idea of comparative analysis based on AltaVista Search and be content with the knowledge that your competitors are missing important traffic.
The indexing barriers you'll encounter include registration, frames, databases, dynamic pages, and text in Acrobat or PostScript. Some sites require users to complete a form before they can advance to the real content. In some cases, the purpose is simply to capture information about users because access to the site is free. In other cases, the site owner is charging membership/subscription fees to get to the content, and passwords are necessary to get in.
Web crawlers, like Scooter, are dumb robots. Because they can't fill out forms or supply passwords, they can't access sites containing indexing barriers. Normally, such sites will allow crawlers to index their home pages but none of the remaining content. If you can reach the other pages at the site directly by entering their URLs, you can add those pages to the AltaVista Search index manually using Add URL. But if all the pages are password-protected, you can't index them. That's a tradeoff that the site owner is making (consciously or unconsciously). The owner gets user information and/or subscription fees but misses the traffic that would otherwise find the site through a search engine like AltaVista Search.
Some commercial sites use frames, where certain information--typically the company logo, site-navigation buttons, and/or banner ads-->remains constant around the outside of the page, while the real content appears in a rectangular box. Unfortunately, Web crawlers see only the information in the outside of the frame, not the content inside the box. Since some browsers can't view frames, some sites provide both non-frames and frames versions of their pages. AltaVista Search can index the non-frames version.
On some sites, the information that appears inside the frame actually consists of plain HTML pages, each with a separate URL. You can add those pages to the index using Add URL. Other sites have much of their content stored in databases, rather than in HTML pages. Once again, crawlers can't fill out forms and hence are stopped in their tracks by this approach to presenting information. In some cases, the database is essential and, by entering queries, the user can generate unique reports. In other cases, the underlying information is actually plain text, such as resumes and job listings, which could just as well be presented as ordinary text. You can add a duplicate of that information, presented as plain text Web pages, to the index.
Some of the more advanced commercial sites today offer various kinds of "dynamic" pages, where users are presented with new material each time they visit a site. (Sometimes, the material is based on a user profile or on "cookies," code that alerts the site about what the person did on his or her last visit to the site.) Such a site potentially could serve up an infinite number of unique Web pages. (This explains why it's impossible to answer the question, how many pages are there on the Web?) It's basically useless to index such pages, since only rarely or randomly will you ever see the same page twice. If such a site has core content that you can present in static pages, it may be wise to do so and make sure AltaVista Search indexes those pages.
Also, while Acrobat and PostScript formats give the information provider precise control of the page layout, AltaVista Search can't decode and index information in those formats. (However, tools exist for indexing such material on an intranet, using a commercial version of AltaVista Search. For more details, check the AltaVista software site at http://www.altavista.software.digital.com.)
If you use AltaVista Search to compare Web sites, make sure the sites' index information is complete, current, and comparable. And if you can improve the information in the index by using Add URL, or if you can persuade the Webmaster to alter pages in order to make them indexable, do so. Improve the accuracy of your search instrument before making your measurements and doing your analysis.
As you try these techniques, please let us know about your successes and failures. Send us your tips, the creative approaches you've tried, and your questions. You can reach the author directly at seltzer@samizdat.com.
Go to Richard Seltzer's AltaVista Search tutorial
Return to B&R Samizdat Express
Can we help you build an Internet business? Richard Seltzer is an independent Internet writer/speaker/consultant. Click here for details. or send email to seltzer@samizdat.com
| Internet Business Showcase: | |||
|
|
|
|