Reprinted with permission from Internet Search Advantage, ZD Journals. http://www.zdjournals.com
How to translate this article into French, Spanish, Italian, Portuguese, or GermanComment traduire en français, Cómo traducir a los españoles, Come tradurre in italiano, Como traduzir em portuguêses, Wie man in Deutschen übersetzt.
To simplify finding anything on the Internet, LiveTopicsbeta generates categories on the fly and provides a graphical view of those categories. As a result, you get more than just pointers to nuggets of information--you get information about information, an overview of what material is available, and some clues as to how pieces of data relate to one another. The results appear in a flash and, in that moment of enlightenment, it's tempting to jump to sweeping conclusions. But watch out! The quality of your results depends entirely on the quality of your query, and if you aren't careful, you could wind up eating a very large slice of humble pie.
The LiveTopics terms are based on statistics, not on some profound understanding of the meaning of words. Fallacies can arise from the particular words you use, the typography of your query, or even the language you're using. But, if you're careful, the results can be truly amazing.
Figure A: Using the search query +*, you`ll get 30 million
hits.
Now, click LiveTopics' Way-cool Topics Map! link. You'll see 20 words in rectangular boxes that are connected with lines. Those are the top 20 categories of information on the Web--you're looking at a picture of the entire Web.
Keep in mind that these categories are created statistically, live, on-the-fly, and based on the information in the AltaVista Search index. There's no human bias. No one is making assumptions about the content. This isn't a list of categories, like a library's Dewey decimal system, based on human judgment and subject to obsolescence. This is a view of today's Web content based on the information currently in the AltaVista Search index. These are the significant words (not articles like the or auxiliary words) that happen to appear most frequently on the Web. A line connecting terms means the documents containing the one word often contain the connected word as well.
Now, click the Topic Words tab. You'll see a list of the 20 terms--the words in bold--and, next to each term, words that appear most frequently on pages that also contain the main term.
So what's the Internet about (at this broadest level of abstraction)? Lots of the main terms relate to education (school, students, learn, faculty) and research--one of the original intentions for the Internet. You also see terms relating to the underlying technology and how it's deployed (Internet, web, server, online). Business is there as well, with the words opportunities, programs, businesses, career, employment, assistance, and job, which seem to indicate lots of use for hiring and finding jobs. And professionals appears as well, with words relating to membership in professional associations.
Under the topic viewed, you see netscape and navigator, and also microsoft and explorer, perhaps indicating some rough parity between the two major Web browsers and the companies that provide them. What you don't see is also telling. For instance, the word sex doesn't appear, nor does anything related to shopping, online sales, or electronic commerce.
Note also that all the terms and words are in English, not because of a limitation of Live-Topics. The AltaVista Search index contains all text, regardless of language, and Live-Topics uses that entire index. This is simply testimony to the dominance of the English language on the Web today.
You might want to perform this test every month or two to get a sense of how the Web is evolving and how quickly it's doing so. This technique is a good reality check. Why believe the opinions of reporters when you can take a look at the entire phenomenon--with current data--any time you want; and it only takes seconds to see it?
Since we're trying for a broad approximation, God is most likely to give us the best results. This query yields more than 800,000 matches. Click the Way-cool Topics Map! link and let's see our search results.
The terms all interconnect and seem to relate to Judeo-Christian religion, with a heavy emphasis on Christian. It's interesting to see that words most frequently associated with God include truth, love, glory, and blessed. Click the Topic Words tab. At the next level, as you can see in Figure B, under the topic Revelation, there are numerous terms related to the Moslem religion (islam, muslims, muslim), so our first approximation was inaccurate that these topics all related to Judeo-Christian hits.
Figure B: Our Topic Words screen reveals a few words associated
with the Moslem religion.
Once again, all the words are in English. On the one hand, that would be expected because of the strong English-language bias of the Internet today. But remember, you stacked the deck by using an English word as your query word. If you wanted to capture content in other languages, you`d have to add Dieu, Gott, and so forth to your query. Adding Yahweh, Allah, and Buddah would also broaden the coverage of non-Christian religions.
But even using this crude one-word query approximation, we'll probably see major changes in our findings over time, as the audience on the Internet expands and becomes much more diverse and international.
These kinds of results make it very tempting to read more into the statistics than is warranted. For instance, you might want to check the number of matches for a range of countries, record the results, and do the same queries at fixed intervals to track the relative growth of the use of the Web in these various countries.
But several factors will make such an approach highly questionable. First, the Web site of a company in a given country doesn't need to use that country's domain suffix in its Web address. Some companies go out of their way to obtain domain names that are .com (commercial) without any indication of the countries in which they operate. And some companies in countries that have poor Internet access have other countries' Web sites host their pages. For example, Internet lines to Colombia are few and slow, and Colombia's infrastructure is U.S.-centric. When you're sending E-mail from one part of Bogota (the capital) to another, the message has to bypass systems in California. So, most Colombian companies doing business on the Internet today do so by way of Web servers in the U.S. That means that a search for domain:co would likely yield misleading and incomplete results.
If you do a search for domain:zw (Zimbabwe) and see only six pages in the results list, it would be unwise to jump to the conclusion that companies there don't use the Internet. Rather, it's likely that many have non-country specific domain names or are hosted elsewhere. That kind of result should be the beginning of a more detailed and probing series of searches, rather than the conclusion.
Also, recent reports indicate that, in response to the growing difficulty in reserving domain names that match your company or product name, the country of Tonga (the Friendly Islands in the South Pacific), with the suffix .to, is going into the business of selling domain names to any and all comers. Tonga hopes to become for domain names what Delaware is for incorporation and what Panama is for ship registration. If Tonga is successful, it will further blur the focus of country portraits in LiveTopics. But, for the moment, that`s not a factor.
Now, go to the Advanced Search page. Click the down arrow next to Standard Form. Then click As A Count Only. If you enter your query now, you'll get a count and only a count--no list of results--and it will be more accurate than what you see on the Simple Search page. Because the system doesn't have to provide a list of results, it can use the extra cycles to provide a better count. But, still the number is only an estimate. If you submit the same query several times in a row or for several days in a row, you'll likely get different counts. The different numbers don't mean that the underlying index has changed or that something is broken. They simply reflect the fact that the counting function was designed only to provide approximations. The smaller the numbers, the more accurate they're likely to be. But if you get more than 10,000 counts, the approximations can vary considerably, which can be very helpful in quickly checking variant spellings of a new term, which doesn't yet appear in dictionaries. But if the numbers are close, don't lean too hard on them. For example, don't advertise that your Web site has a dozen more hyperlinks to it than your competitors' sites. Don't try to read too much into numbers that were never meant to be precise.
Go to Richard Seltzer's AltaVista Search tutorial
Return to B&R Samizdat Express
Can we help you build an Internet business? Richard Seltzer is an
independent Internet writer/speaker/consultant. Click
here for details. or send email to seltzer@samizdat.com
| Internet Business Showcase: | |||
Dog Training and Pet Care Veterinary, Dog/Cat Veterinarians |
Viatical Settlements RFID Tags, RFID Readers Link Popularity & Link Exchanges |
Hair Restoration for Hair Loss Hair Restoration Plastic Surgery & Surgeons |
Used Cars Guide Homecoming & Prom Dresses Dropship & Wholesale Sources |