Taking "snapshots" with LiveTopics: Watch out for mirages

by Richard Seltzer, seltzer@samizdat.com, www.samizdat.com

Reprinted with permission from Internet Search Advantage, ZD Journals. http://www.zdjournals.com

How to translate this article into French, Spanish, Italian, Portuguese, or GermanComment traduire en français, Cómo traducir a los españoles, Come tradurre in italiano, Como traduzir em portuguêses, Wie man in Deutschen übersetzt.



ALERT: LiveTopics is a powerful feature that was once offered through AltaVista. They later renamed it "Refine." It is now no longer available. The project, code named "Cow9," was a collaborative effort between researchers at Digital Equipment Corporation and François Bourdoncle of Ecole des Mines de Paris, www.ensmp.fr The underlying technology has enormous potential. This article will give you a sense of how it can be used. If this topic interests you should also check the slides beginning at www.samizdat.com/script/lt1.htm and related articles:

To simplify finding anything on the Internet, LiveTopicsbeta generates categories on the fly and provides a graphical view of those categories. As a result, you get more than just pointers to nuggets of information--you get information about information, an overview of what material is available, and some clues as to how pieces of data relate to one another. The results appear in a flash and, in that moment of enlightenment, it's tempting to jump to sweeping conclusions. But watch out! The quality of your results depends entirely on the quality of your query, and if you aren't careful, you could wind up eating a very large slice of humble pie.

The LiveTopics terms are based on statistics, not on some profound understanding of the meaning of words. Fallacies can arise from the particular words you use, the typography of your query, or even the language you're using. But, if you're careful, the results can be truly amazing.

The entire Web in one glance

If you need to explain the Internet to someone who is unfamiliar with it or if you simply want to wow a friend with your vast knowledge, go to AltaVista Search and enter the query +*. You'll receive more than 30 million matches--every page in the AltaVista Search index, which is the entire public Web. Figure A shows the result of this query. The asterisk (*) is a wildcard, standing for any unknown letter or letters. And the plus sign (+), in this particular instance, forces Alta-Vista Search to make an exception to its general rule that three characters must precede such a wildcard, which can stand for up to five characters. (The developers simply couldn't resist the temptation to show what this system can do.)

Figure A: Using the search query +*, you`ll get 30 million hits. Figure A

Now, click LiveTopics' Way-cool Topics Map! link. You'll see 20 words in rectangular boxes that are connected with lines. Those are the top 20 categories of information on the Web--you're looking at a picture of the entire Web.

Keep in mind that these categories are created statistically, live, on-the-fly, and based on the information in the AltaVista Search index. There's no human bias. No one is making assumptions about the content. This isn't a list of categories, like a library's Dewey decimal system, based on human judgment and subject to obsolescence. This is a view of today's Web content based on the information currently in the AltaVista Search index. These are the significant words (not articles like the or auxiliary words) that happen to appear most frequently on the Web. A line connecting terms means the documents containing the one word often contain the connected word as well.

Now, click the Topic Words tab. You'll see a list of the 20 terms--the words in bold--and, next to each term, words that appear most frequently on pages that also contain the main term.

So what's the Internet about (at this broadest level of abstraction)? Lots of the main terms relate to education (school, students, learn, faculty) and research--one of the original intentions for the Internet. You also see terms relating to the underlying technology and how it's deployed (Internet, web, server, online). Business is there as well, with the words opportunities, programs, businesses, career, employment, assistance, and job, which seem to indicate lots of use for hiring and finding jobs. And professionals appears as well, with words relating to membership in professional associations.

Under the topic viewed, you see netscape and navigator, and also microsoft and explorer, perhaps indicating some rough parity between the two major Web browsers and the companies that provide them. What you don't see is also telling. For instance, the word sex doesn't appear, nor does anything related to shopping, online sales, or electronic commerce.

Note also that all the terms and words are in English, not because of a limitation of Live-Topics. The AltaVista Search index contains all text, regardless of language, and Live-Topics uses that entire index. This is simply testimony to the dominance of the English language on the Web today.

You might want to perform this test every month or two to get a sense of how the Web is evolving and how quickly it's doing so. This technique is a good reality check. Why believe the opinions of reporters when you can take a look at the entire phenomenon--with current data--any time you want; and it only takes seconds to see it?

An example

Let's try an example just for fun. In a Simple Search field, enter the query God. You might be tempted to try god* to capture gods as well as god. But if you take that approach, you get about 100,000 irrelevant results--such as goddard--where god is just the first syllable. You might also be tempted to enter the query with all letters in lowercase to be sure to capture all instances of either god or God. But that, too, yields about 100,000 extra results, many of which are instances where the word is used in a non-religious sense.

Since we're trying for a broad approximation, God is most likely to give us the best results. This query yields more than 800,000 matches. Click the Way-cool Topics Map! link and let's see our search results.

The terms all interconnect and seem to relate to Judeo-Christian religion, with a heavy emphasis on Christian. It's interesting to see that words most frequently associated with God include truth, love, glory, and blessed. Click the Topic Words tab. At the next level, as you can see in Figure B, under the topic Revelation, there are numerous terms related to the Moslem religion (islam, muslims, muslim), so our first approximation was inaccurate that these topics all related to Judeo-Christian hits.

Figure B: Our Topic Words screen reveals a few words associated with the Moslem religion. Figure B

Once again, all the words are in English. On the one hand, that would be expected because of the strong English-language bias of the Internet today. But remember, you stacked the deck by using an English word as your query word. If you wanted to capture content in other languages, you`d have to add Dieu, Gott, and so forth to your query. Adding Yahweh, Allah, and Buddah would also broaden the coverage of non-Christian religions.

But even using this crude one-word query approximation, we'll probably see major changes in our findings over time, as the audience on the Internet expands and becomes much more diverse and international.

Portraits of countries

Using the command domain: in LiveTopics lets you create portraits of Web activity in particular countries. For instance, domain:au will match pages on Web servers in Australia; domain:uk matches the United Kingdom; domain:fr matches France; domain:co matches Colombia, and so forth. These domain-name suffixes are part of the basic naming structure of the Internet. Suppose your company is interested in doing business in Italy. A search for domain:it will yield about 800,000 matches. The Topic Words screen shows that 19 of the 20 categories are in the Italian language. (Local language is beginning to be very important.) The other category--Italian--appears to be in English. And in the Topic Graph view, the pages fall into two large clusters. If you understand Italian, and have the time and inclination, you might be able to derive some useful conclusions about the underlying cause of that divide.

These kinds of results make it very tempting to read more into the statistics than is warranted. For instance, you might want to check the number of matches for a range of countries, record the results, and do the same queries at fixed intervals to track the relative growth of the use of the Web in these various countries.

But several factors will make such an approach highly questionable. First, the Web site of a company in a given country doesn't need to use that country's domain suffix in its Web address. Some companies go out of their way to obtain domain names that are .com (commercial) without any indication of the countries in which they operate. And some companies in countries that have poor Internet access have other countries' Web sites host their pages. For example, Internet lines to Colombia are few and slow, and Colombia's infrastructure is U.S.-centric. When you're sending E-mail from one part of Bogota (the capital) to another, the message has to bypass systems in California. So, most Colombian companies doing business on the Internet today do so by way of Web servers in the U.S. That means that a search for domain:co would likely yield misleading and incomplete results.

If you do a search for domain:zw (Zimbabwe) and see only six pages in the results list, it would be unwise to jump to the conclusion that companies there don't use the Internet. Rather, it's likely that many have non-country specific domain names or are hosted elsewhere. That kind of result should be the beginning of a more detailed and probing series of searches, rather than the conclusion.

Also, recent reports indicate that, in response to the growing difficulty in reserving domain names that match your company or product name, the country of Tonga (the Friendly Islands in the South Pacific), with the suffix .to, is going into the business of selling domain names to any and all comers. Tonga hopes to become for domain names what Delaware is for incorporation and what Panama is for ship registration. If Tonga is successful, it will further blur the focus of country portraits in LiveTopics. But, for the moment, that`s not a factor.

Beware of the counter

When AltaVista Search first came out, some ambitious, creative, and enthusiastic reporters went wild over the statistics. Publications like The New York Times and The Wall Street Journal ran articles based on the numbers of matches that AltaVista Search found to various queries, rather than the information it led them to. For instance, they compared the number of matches for Jesus Christ versus John Lennon as if it were a competition and drew broad-and-amusing conclusions from the results. Someone even wrote a program to query AltaVista Search each day for Windows 95 and automatically generated a graph of the counts, day by day, to document the spread of that operating system. But the counts that AltaVista Search provides are only rough approximations. The main intent of the service is to help you find information on the Internet, not to provide statistics about the Internet and its content. The ranking algorithm that determines which items appear high on the results list requires a rough count, so such a count is generated. But it's not meant to be precise. The vast majority of users look only at the first screen or two of results, and striving to achieve scientific accuracy in the count would simply slow down the system, making it far less useful for everyone.

Now, go to the Advanced Search page. Click the down arrow next to Standard Form. Then click As A Count Only. If you enter your query now, you'll get a count and only a count--no list of results--and it will be more accurate than what you see on the Simple Search page. Because the system doesn't have to provide a list of results, it can use the extra cycles to provide a better count. But, still the number is only an estimate. If you submit the same query several times in a row or for several days in a row, you'll likely get different counts. The different numbers don't mean that the underlying index has changed or that something is broken. They simply reflect the fact that the counting function was designed only to provide approximations. The smaller the numbers, the more accurate they're likely to be. But if you get more than 10,000 counts, the approximations can vary considerably, which can be very helpful in quickly checking variant spellings of a new term, which doesn't yet appear in dictionaries. But if the numbers are close, don't lean too hard on them. For example, don't advertise that your Web site has a dozen more hyperlinks to it than your competitors' sites. Don't try to read too much into numbers that were never meant to be precise.

Conclusion

Basically, the more you know about AltaVista Search and about the subject you're researching, the better you'll be able to avoid jumping to false conclusions. This isn't the Holy Grail of information--but LiveTopics can be a very effective tool if you take the time to learn how to use it. As you try these techniques, please let us know about your successes and your frustrations. Send your tips, the creative approaches you've tried, and your questions to seltzer@samizdat.com

Go to Richard Seltzer's AltaVista Search tutorial

Other search articles

Return to B&R Samizdat Express

Can we help you build an Internet business? Richard Seltzer is an independent Internet writer/speaker/consultant. Click here for details. or send email to seltzer@samizdat.com
 


Internet Business Showcase:
| | 
Google
  Websamizdat.com