Advice about Advanced (at AltaVista)

By Richard Seltzer, seltzer@samizdat.com, www.samizdat.com


Reprinted with permission from Internet Search Advantage, ZD Journals. http://www.zdjournals.com

How to translate this article into French, Spanish, Italian, Portuguese, or GermanComment traduire en français, Cómo traducir a los españoles, Come tradurre in italiano, Como traduzir em portuguêses, Wie man in Deutschen übersetzt.



Most people who use Advanced Search at AltaVista presume that it's really the same as Simple Search -- that while it uses different syntax for the queries (with Boolean operators), it should provide the same results for equivalent queries. This misconception can lead to confusion, and divert you from taking full advantage of the power of this alternative search mode.

Yes, queries in Simple Search (the default method of searching at AltaVista) have their "equivalents" in Advanced Search. For instance, the Simple query "new york" baseball translates to

"new york" OR baseball

and the Simple

+"new york" +baseball

would read in Advanced

"New York" AND baseball

And, yes, both Simple and Advanced use the same underlying full-text index. But "equivalent" queries normally produce very different results. You are likely to see 1) a different set of matches on your first screen and 2) different counts of matches.

Ranking

In Simple, there is only one form to fill out -- the query box. And in your results list, documents in which the query words and phrases appear in the HTML title come out higher than those where they do not. And, likewise, documents in which those words and phrases appear in the first couple lines of text come out higher than those where they are buried deeper. Logically, those are the documents where those terms are likely to be an important element of the content, rather than a random occurrence.

In Advanced, there is a "ranking" box as well as a query box. Most people ignore the ranking box, but it's a very powerful tool.

You should enter in the query box the words and phrases that you are looking for. And you should enter in the ranking box the terms you would expect to see on the pages that would be of most interest and value to you -- the pages that you want to appear near the top of the list of matches.

For instance, in advanced search, you can enter

recipe*

in the query box, and in the ranking box enter a list of everything you happen to have in your refrigerator. That query will give you a list of pages including the word "recipe" or "recipes," with those with all or most of those items mentioned at the top of the list.

Note how very different this is from using a cookbook or other reference work, or from searching for information at a library or using an Internet directory, like Yahoo. Here you don't need to know the categories of the field you are dealing with. In this case, you don't need to know what any of these dishes might be called. You just need to know the ingredients.

Similarly, you can enter

resume

in the query box, and in the ranking box enter the qualifications and experience that you are looking for in someone you'd like to hire.

That query will give you a list of pages including the word "resume", with those with all or most of ranking words mentioned at the top of the list.

For instance, I recently received a question from a reader who is working as a health officer in a goverment hospital in India. His job current job title is "multipurpose health supervisor/worker." He wants to find a similar job in Australia, Canada, or the U.S. He asked me for advance in constructing a query at AltaVista. I suggested that he use Advanced Search and enter in the query box

job AND health

and enter in the ranking box

Australia Canada America US USA supervisor multipurpose

There is no way to achieve these results in Simple Search. And there is no way to do "equivalent" searches in both Simple and Advanced -- even though the queries may be logically equivalent -- because the ranking will be very different. You can't control ranking in Simple Search (it's all automatic), and in Advanced Search, if you leave the ranking box black, your results come back in random, unranked order. So in most cases, you will get different lists of matches for "equivalent" queries.

Counts

If the query is the same and the underlying index is the same, even if the ranking mechanism is different, you might expect that the number of matches for "equivalent" queries would be the same. But no, that is not the case.

At AltaVista, the count is only meant as an estimate -- to tell the user the order of magnitude of the matches. The system wasn't designed to provide exact counts, but rather to provide search results and to serve tens of millions of people searching through tens of millions of documents, and give answers in a few seconds. Since the ranking algorithm generates estimates of the count, that information is provided as a convenience; but in allocating system resources, the count is not given priority. And since the count is a by-product of the ranking, the different ranking approaches in Simple and Advanced usually produce different estimates.

Similarly, if you do the same search in Simple on different days, or the same search in Advanced on different days, you may see very different count estimates (probably within a factor of two of one another), even if the contents of the index did not change significantly in the interim.

In more technical terms, if (as is common) the number of matches is very large, Altavista estimates the total number of matches based on partial results.

And if the system is busy, it may even cut the estimate short, in order to use its resources to answer new queries. So don't be surprised if the same query results in 2.5 million matches one time, 2.0 million the next, and 3.0 million the next. Statistically, those estimates are all in the same ballpark; and in any case, you only see the top matches on your screen.

In Advanced Search, you can choose "As a Count Only." Because you aren't asking to see the results list, the system can afford to devote more resources to provide you with a more accurate count. But at busy times, for load balancing, the system might still cut the count short. For the most accurate possible estimate, connect at a relatively slow time (when people in California are sleeping).

Getting all the results

In Simple Search, while you may be told that there are over a million matches to your query, you only see a list of 10 of them.

You can click at the bottom of the screen to see the next 10, and the next 10. But after you have seen 20 pages of results (200 items), you don't get any more. That meets the needs of well over 99% of the users. But there are some instances when you want all results and don't care about the ranking order.

For instance, you might want to know about all the Web pages that have hyperlinks to your pages. In that case, every such page is of value to you, and ranking is irrelevant.

To see all the results, go to Advanced Search, enter your query in the query box (e.g., link:digital.com AND NOT host:digital.com which asks for all pages with hyperlinks to Digital that are not at Digital's site) and leave the ranking box blank. Then the results appear in random order (with no ranking), but you can keep clicking to get more and more of the results. I've done that looking for links to my own site (http://www.samizdat.com), and have seen hundreds of results.

It is possible that you may want to see thousands of results. In that case, you are likely to run into a limit at some point. (Exactly where that limit is may depend on available system resources at the time you make your query). Afterall, this is not what the system was designed for. Maybe once in over a billion times will someone need to see thousands of matches. But still you can get what you want by using a workaround.

Advanced Search lets you constrain your search by date -- by the date when pages where last modified. So depending on the magnitude of what you need, limit the search first to one year and then another, or first to one month and then another, or even search day by day -- seeing and saving all the matches for each time-frame increment.

Note

As you try these techniques, please let me know about your experience. Send me your tips -- the creative approaches you have tried -- and also your questions. Let's share and learn from one another. You can reach me directly at seltzer@samizdat.com.


Go to Richard Seltzer's AltaVista Search tutorial

Other search articles

Return to B&R Samizdat Express

Can we help you build an Internet business? Richard Seltzer is an independent Internet writer/speaker/consultant. Click here for details. or send email to seltzer@samizdat.com


Internet Business Showcase:
| |

Internet Business Showcase: