This tutorial and the companion piece, below,
"How to use AltaVista Search to Improve Your Web Site," are
based on my research for the book AltaVista Search Revolution
(published by Osborne/McGraw-Hill), my experience with my own
personal Web site, and what I learned doing consulting for
AltaVista. This tutorial also served as the basis for one that I
wrote for the AltaVista site and which they included in their
Advanced Search area.
The tutorial is written in the present tense,
but it dates back to the year 2000.
My goal is not that you remember all the different commands, but rather to give you a sense of the wide range of things that you can do with AltaVista, so you can go to the help files and figure it all out when you need to. If you don't know that you can do it, you never ask the question.
That's the reason for going through this. I'm going to show you some things that you probably never realized were there.
Basically, there are three different ways of searching at AltaVista: Directory, Simple Search, and Advanced Search. There's one directory (an enhanced version of the Open Directory, combined with LookSmart) and a single full-text index, with two ways of moving through it. Which one you choose is mostly a matter of your own personality and style, your own way of looking for things.
Directories, like Yahoo www.yahoo.com, the Open Directory dmoz.org, and LookSmart www.looksmart.com, provide categorized lists of Web sites with a brief description of each site. You can navigate by going from menu to menu, making one selection after another until you finally get to the level where sites of the kind you are interested in are listed. You can also search through the database that contains the descriptions of all of them. The categories and descriptions are based on submissions by Web site owners, scrutinized and edited by professional or volunteer editors.
If you are looking for a summer camp in Maine or a recycling facility in Alabama, a directory is probably a good place to look. You are looking for a general category of things and that's how directories are organized.
While a directory categorizes Web sites and contains very little information about them (just the description), a search engine indexes all the information on all the Web pages it finds.
Directories are crafted by human beings, based on their judgement, like files in a file cabinet. If you happen to think like the people who built a particular directory, you may find it very easy to use; if your mind is organized differently, you may find their approach awkward and difficult to follow.
Search engines don't wait for someone to submit information about a site. Rather, they send out robot programs (called "crawlers") which surf the Internet and bring back the full text of the pages they find. Search engine indexes are generated automatically, based on the words and phrases that are found on Web pages. The largest search engines, like my favorite, AltaVista www.altavista.com, cover over 250 million Web pages. There is no human judgement filtering or rearranging the information before it gets to you. If you know how to use the query language effectively, you can go straight to what you want when you want it.
From the AltaVista home page, when you click on one of the categories, like Business and Finance, you are using an enhanced version of the Open Directory, combined with LookSmart Directory. From there, you can click on an individual topic and go from menu to menu, until you arrive at what you want. If you enter a term in the search box at the top of the page after having clicked on a category, your search is limited to the directory. You can choose to search through just that particular category or all cateogries, but you are only looking through the descriptions of pages that match those categories. If you opt to search the Web, that takes you to the main AltaVista search index.
Today, the Open Directory boasts that it includes 2,057,399 sites, reviewed by 29,079 volunteer editors, in 312,006 categories.
A directory takes you to the home page of a Web site, from which point you can explore to eventually get to what you want.
A search engine takes you to the very page on which the words and phrases you are looking for appear.
Use a directory when you only have a vague idea of what you want, and when you would appreciate prompts to guide you along.
Use a search engine when your aim is to get to a particular piece of information quickly.
When you want to find a great music site or a site devoted to your favorite kind of movie, use a directory.
When you want to know what song or movie a particular phrase is from, use a search engine.
Use a directory to get a list of major newspaper sites.
Use a search engine to find a quote from a newspaper column, even when you don't know the name of the paper or the columnist.
Use a directory for the kinds of things you'd expect to find in the Yellow Pages -- for businesses of certain kinds when you may not know the names of the businesses.
Use a search engine when you are looking for information about a particular product and know the product name and model number, but may not know the manufacturer.
Use a directory when you are in the mood to surf -- to go on a fun trip around the Web with no particular destination in mind, just following your impulses.
Use a search engine when you are serious and when you have limited time to find what you want.
Use a directory when you are looking for a site devoted to a celebrity.
Use a search engine when you are searching for an ordinary person by name.
Use a directory to find cooking-related sites.
Use a search engine to find a particular recipe, looking for it by name or by its ingredients.
Use a directory to get a list of four-year colleges in Massachusetts.
Use a search engine to find a particular paper written by a professor of anthropology at the University of Massachusetts.
Use a directory to see a list of sites devoted to alternative medicine or to cancer.
Use a search engine to learn more about a medicine your doctor just prescribed for you.
Use a directory to get a list of job-related Web sites.
Use a search engine to find the resume of a job candidate with the credentials and experience you want.
Use a directory to find Web sites dedicated to discussion of great literature.
Use a search engine to find a particular passage in a particular classic work, and perhaps the complete text of that book.
Use a directory to find sites devoted to buying and selling cars.
Use a search engine to find a page that talks about how to deal with the problems you've been experiencing with your 1969 Mustang.
Use a directory to find sites that deal with Windows-based software.
Use a search engine to find out the meaning of a particular error message you've been getting.
Use a directory to find travel guides.
Use a search engine to find the schedule for special trips on steam-engine-powered trains in South Africa.
Use a directory to find Web sites devoted to legal questions related to protection of intellectual property rights.
Use a search engine to find instances of plagiarism of your writing on the Web.
Use a directory to find Web sites devoted to trademark information.Use a search engine to find out if a particular name, with unique capitalization, which you'd like to make a trademark, is already in use on the Web.
Over time, the information in a directory becomes stale -- sites grow and change and disappear. Gradually, it's content becomes like last year's Yellow Pages. The people editing directories typically don't have time to go back and revisit sites afer they have been added. Rather, the site owner has to submit changes, which are processed by hand, just like original site submissions. Or a user may alert the editors that a URL is now dead. With its enormous number of volunteer editors, the Open Directory can parcel out responsibility for certain topic areas to certain individuals an try to keep pace, but it's hard for people to keep up with the pace of change and growth on the Web.
Search engines try to include all Web pages that contain real, un-duplicated information, without making value judgements about that information. AltaVista Search keeps its information current by completely rebuilding its index with a "big crawl" every few months, with updating crawls that bring in millions of pages a day, and by immediately responding to specific requests to add particular pages.
To add information to a directory, the site owners have to submit descriptinos and have to prove that they are who they are.
When you add your URL (Web page address) at AltaVista, you are simply asking the crawler to fetch a particular page. The crawler brings back whatever text it finds at that address, and it's usually added to the index by the net day. All it knows is what it found on that page, not what anyone told it. Anyone -- not just a site owner -- can submit a URL. You might want to submit a page that has useful information and that you'd like others to be able to find.
If you submit a URL for a page that doesn't exist, the crawler will get Error 404, which means there is no such page (not that the crawler couldn't get there due to some transient problem, but that a page with that address does not exist on that server). The if that page was in the index, AltaVista will remove it from the index, typically by the next day. That means that anyone who finds a dead link in a list of matches can help the community of users by submitting that URL to AltaVista, and hence having that page removed from the index.
Louis Monier, the lead developer of AltaVista when it first got started, loved Simple search. He used it all the time.
Advanced search might seem more natural for an engineer. That's where you get to use Boolean logic, with AND, OR, AND NOT so you can very precisely specify what you want.
Simple search is intended so you can just type a lot words, without knowing anything about what was happening in the background, and still get useful results.
They wanted to let people use a precise way of searching so they kept the Advanced search -- which is really how the whole mechanism works in the background, using the ANDs, ORs, etc. But they also wanted a way that the ordinary person just coming to this search engine for the first time could get useful results, even in the case, which happens so often, that they've never looked at the help files.
One of the interesting things about Simple search is that I could enter a long string of words -- say 30 of them -- with no punctuation. I just type in any word that has anything to do with whatever I'm trying to find. If I then click, I might get a result back that says there were 5 million matches. But I'd have nothing to worry about, because those pages that have all 30 of those words will be the ones that are on the top of the list. And way down at the 5 million mark would be the ones that only had one of them. So the great likelihood would be that the things that I'm looking for would be right up front. And the more words I type in, the better the results I get.
Note that on this page, the Web is the default. They keep adding alternatives and changing the layout of the home page, but today's other choices include:
You can also click on the down arrow beside "any language" and choose to limit your search to content in one particular language. For instance, I could search for only documents that are in French or Serbo-Croatian or Korean.
The underlying Web index at AltaVista understands nothing about any language. At the basic index level it is dumb, and that dumbness is a tremendous strength. Search engines that are built around the syntax of any given language lock themselves out of the rest of the world. AltaVista just captures all the text that it finds. Within a couple weeks of when they went live, they were surprised to get email from some people in Korea who had gone in and using their Korean keyboards had typed in queries and had gotten good results to Korean pages. AltaVista was simply capturing the underlying code.
You run into some problems with Russian and some Asian languages, where there are a variety of encodings of the same language. For help with searches in those languages, click on Customize, then on Language Options and then choose the type of encoding you need for the language you want to use. Basically, in most cases, this gives you what you need. That's an amazing power.
There are many ways in which this basic notion of a seemingly disorderly index is a source of great power. I'll mention other instances of that as we go along.
Now say in this string of words you've chosen, some of those words are more important than others. There might be two or three terms that you really absolutely want to have in the result. In that case, put a plus sign in front of the terms that you really want. There also might be words that you know might be confused with what you want. For instance, if I were doing a search for "Digital," I might want to exclude "watches". So I would put a minus in front of "watch."
It's also very helpful, when you've done a quick search and you've seen that the first few items in the results list were things that you felt really shouldn't be there -- they weren't the kind of thing you were looking for -- see what's in common about them. Then add one or more terms to your query with a minus sign in front of each, to eliminate those when you submit your query again.
So with pluses and minuses you can focus your search very quickly.
For instance, if I search for my own name, if I just type the words Richard Seltzer (separated by a space), I would get every instance of either Richard or Seltzer. That would be a huge number, and it wouldn't be particularly useful to me.
If I put quotation marks around a set of words, like "Richard Seltzer", that tells AltaVista to look for that phrase -- those particular words in that exact order. That's a very powerful capability, made possible by the fact that AltaVista indexes every single word it finds.
Before AltaVista came along, some search engines relied on knowledge of syntax of a particular language to try to cut down on how much work they were asking their machine to do, and to cut down on the size of their index. They might throw away the little words, like: a, and, the, or, but... AltaVista doesn't throw away any words. It keeps all of them. And because it keeps all of them and remembers not just what the words are, but also their exact order on the page; so I could go in and type in a paragraph or longer out of a book, with quotation marks around it, and do a search to see if somebody has plagiarized my work.
There are lot of things you can do to refine your search based on this capability. In this case, instead of getting back a couple hundred thousand matches for Richard space Seltzer, with a search for "Richard Seltzer" in quotation marks, I get back about 500; and most of the ones at the top of the list are, in fact, about me. (Later, I'll talk about how to make that happen.)
Consider the quotation "to be or not to be". If somebody had made the syntactic decision that little words didn't mean anything, none of those words would be in the index. But, as it is, they are all there, and you can search for that quote, or for any other quote.
When you get up to the size that the index is now, this becomes very interesting. AltaVista recently grew its index. It now has over 250 million pages. There's a lot of information there. And if you haven't been there in the last month or two, you ought to take another look. It's much richer than it was before.
NB -- In Simple Search, AltaVista now recognizes when a series of words often appears as a phrase, and automatically treats those words as a phrase, regardless of whether you put quotation marks around them. For instance, New York City and John Smith would be treated as phrases. But purple dog and Francine McIntyre would not. To know how your query is being treated, look below the list of results pages for the "word count". If you get separate counts for separate words, they are being treated separately. If they are grouped, those are the phrases that have been automatically generated. To force the search engine to consider words separately, place a + sign in front of each of them (e.g., +New +York +City).
If you type a word in all in lower case, AltaVista will search for both lower case and upper case. But if any letter is in upper case, it looks for that and only that.
Marketing people like to put capital letters in strange places to make the spelling of their brans names unique. For instance, Digital had a product called eXcursion, spelled with a capital X. If I search for eXcursion, I get exactly what I want, because only mentions of that product will match that unique spelling.
If you talk to a reference librarian, if you have a question that has many possible ways to answer it, the librarian will be happy -- pull a book off the shelf, give you an answer, and you walk away as a happy customer. But if you ask for something that is rare and difficult to find, that librarian is going to start tearing her hair out, and it's going to take a long time and be very painful to get that answer.
With AltaVista, the more rare, the more unique, the more hard to find something is, the easier it is to find. Because there are over a hundred million pages in the index, chances are excellent that what you want is mentioned. And because it's rare, you are likely to get only a few matches and they are likely to be exactly what you want.
So if the doctor gives you a prescription and you'd really like know what it is before you take it, enter that ridiculously long word and you'll find it here.
If you get a strange error message on your PC or your workstation -- it's a crazy combination of letters and numbers -- type that in at AltaVista. Chances are you'll probably find somebody talking about that problem.
The more rare it is, the easier your search is because you don't have to worry about how to refine your search. What you want is just going to be there at the top of your list.
And using capitalization, like using quotation marks, is a way of making your request rare.
AltaVista was developed by researchers at Digital Equipment's laboratories in Palo Alto, CA. They were trying to climb Mount Everest. They wanted to take on a challenge and that was the biggest challenge that they could find. The challenge as they perceived it was to make it possible for millions of people to get to tens of millions of pages, and to get really quick results. They weren't out to arrive at academic purity. They were try to get practical results. So when they faced with an instance like this -- yes, they would like people to use an asterix, because wildcards are handy. But if they let you put it at the beginning of a word, it would mean that they would have to search through 100 million pages to answer your request. So they arrived at an interesting compromise, and they might change the details at some point as circumstances change. If you type in three characters first, then you can use an asterix to stand for up to five characters. So if I want to search for the English spelling as well as the American spelling of a word like color (colour), I can throw in an asterix where the "u" would be (colo*r). Or very often, I use the asterix to stand for the plural, so I can search for both singular and plural at the same time (dog*).
You can also use an asterix for an element in a phrase. If I have a phrase in quotation marks, like "one if by land and two if by sea", and maybe I can't remember one of those words, then I can throw in an asterix to stand for a missing word ("one if by * and two if by sea").
Here's a large query box (labelled "Boolean") and a small ranking box (labelled "Sort by").
What's the difference between query and ranking? The best example that I can think of has to do with cooking. I'm a lousy cook. I know nothing about cooking. I know so little about cooking that I can't use a cookbook, because I don't know what the dishes are called. I wouldn't know what a sauted something or other was. But I can go to AltaVista in Advanced search, and put recipe in the query box, and in the ranking box I can put in a list of everything that happens to be in the refrigerator right now. I submit the query, and the things at the top of the list are those recipes that have most of those ingredients in it. I'll probably learn something about categories by doing that, but I didn't need to think of the world that way to get good results.
Another example -- say you need to hire somebody. Go to Advanced search and type "resume" in the query box; and in the ranking box, just list the qualifications that you are looking for. There are over a million resumes out on the Web today -- just plain HTML pages that AltaVista has indexed. So when I do this, the people who have most of those qualifications will be right on the top of the list, right off the bat.
It's quick and simple. And it's a different way of thinking, because you don't have to think of categories.
We're so used to thinking the way libraries are organized, like with the Dewey Decimal System. Does it belong in this niche, or this niche, or this niche? But you don't have to think about niches. That's part of the power of it. You don't need to know how the information is organized to find the information you want.
For instance, I can specify the date -- the date that a Web page or a newsgroup item was posted. When I was doing research for the book, one of the very first things I did was search newsgroups. I searched day by day by day for every instance of "AltaVista", because I wanted to know not just what the developers thought about it, but what the users were using it for, which isn't always the same thing. (Unfortunately, they have now taken away the newsgroup search capability at AltaVista; so it would be a lot harder to write a book of that kind today. But the search-by-date capability is still there for doing Web searches in Advanced.)
Also, the default in Advanced Search is to search without site compression. In other words, your results lists includes all pages that matches, even if there are many from the same Web site. If you like, you can turn that feature on and off with a click. (In Simple Search, you have no choice. Site compression is on all the time.)
When AltaVista first came out, some of the most interesting articles written about it were totally bogus. Reporters got the wrong idea and got enthusiastic in the wrong direction. You found articles in the Wall St. Journal and the New York Times writing things like, "We did this search at AltaVista and it showed that John Lennon is more popular than Jesus Christ." They were just looking at the number of matches when they did searches. Well, when the developers were designing AltaVista, as part of the ranking mechanism, they needed to have a rough feel for how many of these things there were, before they put them in any kind of order. And they only needed a rough feel. An order of magnitude was pretty good. And if they were within a factor of two, that was great. But reporters weren't taking it that way. They were taking those numbers literally.
There was also a student up in Canada who did a very clever program when AltaVista first came out, that again was totally bogus, unfortunately. He wanted to track instances of Windows 95 being mentioned on the Web. So he wrote a little program that went in every day and did that search at AltaVista and brought the number back and plugged it into a graph so he could see over time, day by day, how many mentions there were. But it was totally random. He could have had a hundred thousand now and a minute later done the same search and come out at fifty thousand. And as far as the developers were concerned -- hey, that's in the right ballpark.
So you have to be careful before jumping to conclusions.
Over the years, the AltaVista developers have improved the accuracy of the count, but I still wouldn't bet my life on it, because if the machine happens to be busy at the time you go to it, they will truncate the count. So it is still imprecise, but it is better. And if you do it at an odd time of day, it will be more reliably better.
Most of the Boolean Algebra operators have their equivalents in Simple search. NEAR is something you can only do in Advanced.
Why would I want to use NEAR?
I showed you the example of searching for myself, with my name as a phrase, in quotation marks -- "Richard Seltzer". The problem of doing a search like that is, it wouldn't capture any of the odd instances like Seltzer, Richard or Richard W. Seltzer. Richard NEAR Seltzer gives every instance of the words Richard and Seltzer within ten words of one another and in any order.
Actually, I had a case just a week ago where I was trying to find an old friend, let's call her "Elaine Wilcox." Her phone number was unlisted. I knew she was a professor somewhere in Illinois and hence was probably listed on the Web. But a search of her first name and last name in quotes didn't get her. Then when I used Advanced search and entered Elaine NEAR Wilcox, all of a sudden I found her right away in a list of the faculty, but her name appeared there with an initial in the middle, like Elaine I. Wilcox, so the other kind of searching would have never found this instance.
Hence, when you are using AltaVista, don't just presume that your objective is to get a small number of matches. You ought to make sure that your search has captured the full range of what you are in fact looking for and that you haven't just by the way constructed it, cut out the very thing that you are looking for.
I mention this command here to help you understand why you might sometimes want to see more than 200 matches.
link: followed by a domain name or a complete URL gives you every Web page that has a hypertext link to a particular site or to a particular page. That is very useful information. It's useful if you are running your own Web site and you want to know who has links to your site -- those are folks that you want to get in touch, you want to know what they are doing, you want to know why they are linking to you; you might want to link to them. Also, if you change the address of a page of yours, if you change your directory structure, you'd like to know what pages have hyperlinks to the old addresses, so you can go to the Webmasters of those pages and tell them so that they change to the new addresses, so you don't lose that link.
Normally, when I'm going through AltaVista, whether in Simple search or Advanced search, it may say that were half a million or ten million matches, but I'm only going to see 200 of them. I'll see 10 per page. I can click at the bottom of the page -- next, next, next... But after I've gotten 200, clicking for next won't give me anything more.
Once again, the developers are trying to serve the needs of the greatest number of people, as efficiently as possible. Well, that 200 limit is going to serve the needs of 99.999...% of people. Very rarely do you want to go beyond that. But there are some times when you need to go beyond that. And link: is one of those times.
For instance, if I search for link:samizdat.com, I see there are over 700 Web pages with hyperlinks to pages at my personal Web site. That's all good information. I don't want to throw away any of them, and the ranking doesn't make any difference whatsoever to me. Every one of those is equally important to me.
So they have made it so if in Advanced search I enter a query and leave the ranking box blank, they will give me all the results. Once again, it's a compromise. You have said, "I don't need to have the ranking." So that is saving them cycles. So in return, they are willing to give you all the results. You go to the bottom; you hit next; and you will get more than 200.
You can also use parentheses to group complex queries.
Let me give you an example of that. I'm not sure how far in you can nest one set of parentheses inside another, but it can go pretty far.
In this example -- Digital was always been a little bit schizophrenic, was it Digital? or DEC? So if you wanted to use the link: command and wanted to get complete results, then you could have
(link:digital.com OR link:dec.com) AND NOT (host:digital.com OR host:dec.com)
host: means what pages are at that site. So with this query, I'm saying, "Give me all the pages that link to Digital that aren't at Digital's own sites." Because why should I want to know about our own internal links. I just want to know those folks who aren't at Digital who are linking to Digital pages.
That query gives you a sense of how you can organize your thoughts. With parentheses, you can say what operation you want performed before what other operation.
You don't need to capitalize the operators -- AND, OR, AND NOT, NEAR. I just do that for convenience and not to get confused. Lower case would work just as well. It's a matter of personal taste.
Sometimes you know or suspect that the information you want is at a particular site. For instance, if you wanted to limit your search to pages at MIT, in Simple search, you could begin your query with +host:mit.edu
By the way, if all your pages are indexed at AltaVista, you can use host: to provide an index of your own site. For instance, +host:samizdat.com will list every Web page at my site. And +host:samizdat.com followed by query terms will launch a search that is limited just to my site. You can use the same technique I used here for the examples to put a hyperlink at your site which automatically launch a search of your site at AltaVista. Just tell your visitors to enter their query after the term +host:domainname which will automatically appear in the query box.
I very rarely use text: or title: But with those commands, you can limit your search to that part of the Web page.
anchor: is for the words that are highlighted on a Web page that you click on to go to another page. You might remember that once you clicked on a certain set of words on a particular page and you want to go back there. Well, you could search for anchor: followed by those words.
With url: I can limit a search to a particular directory. Maybe someone has an internal site on our intranet or has a hosted site at an ISP and hasn't registered for his own domain name. In that case url: followed by the main directory for that person's pages, will gave me a list of all the indexed pages that are there. Or I can check to see if a particular page is in the index at AltaVista by typing url: followed by the complete address of that page.
domain: here means .edu, .com, .net, and also all the identifiers for different countries. So I can limit a search to page in France, for instance, by beginning my query with +domain:fr.
Keep in mind that the more you know about the Internet, the more effective and useful your searches are likely to be. For instance, today, companies in third-world countries tend to put their Web sites in the US or Europe. They don't put them in their own country because the lines are too slow. So doing a search for domain:co, does not give me a picture of Web sites of companies in Colombia. Rather it shows a very small and random subset.
image: is the name of a graphic image. Today, AltaVista only indexes text. But it does index the names of images. That comes in handy. For instance, you may wish to pull in a picture of Jupiter to add to your site. Just do image:jupiter or image:jupiter.gif or image:jupiter.jpg, and you'll get a list of all the images that have that for a name, some of which are likely to be public domain pictures from NASA.
applet: is a handy command. I can do applet:* and it will give me a list of every Java applet on the Web. And, of course, I can limit it. I could do +domain:fr +applet:* and ask for every Java applet on sites in France. In other words, you can combine these commands in interesting and powerful ways.
object: is very much like applet: only it's for Active-X objects. (There are nearly six times as many applets as Active-X objects on the Web today.)
With the applet: command also, if you have written applets yourself and put them up on the Web, and maybe you did it for the good of humanity and you were hoping other people would pick this up and use it. Well, you could search by the name of the applet, because people would have probably kept the name. Or maybe you put it on the Web and you didn't want anybody to steal it. Well, you could search for the name, and people still would have kept the name and you'll be able to find it that way.By the way, searches intended to show many pages from the same site (host: and url:) automatically turn off the site compression feature in Simple Search. For instance, host: mit.edu lists all the Web pages in the domain of the Massachusetts Institute of Technology, not just one of them.
And if I wanted to, if I had my own Web pages, I could set up my page with anchors that say basically, "if you are interested in X, click here". And I would have carefully constructed all these great queries that are going to get people exactly the kinds of things they want out of newsgroups. And they don't have to learn the commands at AltaVista. I've just done this for them.
Keep in mind that back in the early days of the Web, many pages consisted of nothing more than lists of hyperlinks to other pages. People would post their carefully constructed lists of pages on particular topics. The problem was that those lists soon got out of date -- because so many new pages were added to the Web every day and so many URLs went out of date as Webmasters moved files around. So now, rather than scramble to try to keep such a list current, you can try to construct a query at AltaVista which produces similar results, and put a link from your page to that particular query, so visitors at your site who click on that link will get the latest results.I mentioned the "count". One practical use of that is as a spelling checker. For instance, in fast-changing fields -- like the Internet -- we're constantly coming up with new terms. What's the right way to spell those words? What's the proper usage? I can use Advanced search, click on "count only", and put in the different variant spellings that I can imagine and see how each of them score. If there's an order of magnitude difference, there's my answer. I don't have to wait ten years for dictionaries to catch up. This is what current usage is.
This tutorial and the companion piece How to use AltaVista Search to improve your Web site, (which you should read first), are based on my research for the book The AltaVista Search Revolution (published by Osborne/McGraw-Hill), my experience with my own personal Web site, and what I have learned recently doing consulting/writing for AltaVista. This tutorial also served as the basis for the one that I wrote for the AltaVista site and which they have posted in their Advanced Search area.
The combination of personal Web pages and full-text search, with AltaVista, makes for a very interesting kind of environment. There would be no point in having the personal Web pages if nobody's ever going to find you. But when you do have this capability, then all of a sudden, being found becomes very important to you, and the Internet becomes much less like a library, and much more like a social event -- where plain text pages become an invitation for dialogue.
I'm going to be talking a lot from my personal experience. I have my own personal Web pages that I get free from an ISP (Acunet.net in Marlboro, MA). I run a little site with about 15 Megabytes of text. I have over 1000 documents, some of which are entire books, and I make sure that all my pages are well indexed at AltaVista. With no advertising, I get an average of over a thousand visitors a day and over 80,000 page views per month. Keep in mind that if you don't put graphics up on the Web, you can fit an awful lot of stuff into about 15 Megabytes. In fact, you can put 30 copies of Huckleberry Finn in that space, if you didn't use graphics. And text -- lots of text -- attracts visitors, if you pages are well indexed at search engines. You have a choice putting up 30 books worth of information on the Web for free and attracting an audience for free, vs. maybe putting up a dozen pictures of your family and your pets and getting only the friends and relatives that you badger into visiting. So if you do get your own personal Web pages, keep in mind what the power of that could be if you don't limit yourself.
How is your information going to get into AltaVista, and then how are people going to find you? And how can you use AltaVista as a tool to improve your pages, to get more traffic to your site, to get your pages higher on the ranking lists when people are searching for things that are related to your content and your business.
AltaVista has an index that is built by sending out a crawler (a robot program) that captures text and brings it back.
The main crawler is called "Scooter." (It now has a few cousins, too, which have specialized jobs to do to help keep the index current, such as checking for "dead" links -- pages that have been moved or gone away and should be removed from the index.) Scooter sends out thousands of threads simultaneously. 24 hours a day, 7 days a week, Scooter and its cousins access thousands of pages at a time, like thousands of blind users grabbing text, pulling it back, throwing it into the indexing machines so the next day that text can be in the index. And at the same time, they pull off, from all those pages, every hyperlink that they find, to put in a list of where to go to next. Because, of course, there is no one central registry for Web pages. When you create a Web page, you don't have to tell a soul. You just put it up there. There is no central place for the crawler to go to and say, "Tell me about all the pages out there." No, it has to discover the pages by going from link to link to link. Because of that, you can't predict with assurance when AltaVista might find a new page of yours.
Imagine you are playing a game with about 500 million pages out there and you have a few thousand threads going all the time. A thousand is a big number, but it's nowhere near as big a number as a hundred million. So what do you think the odds are that it's going to find your page in the next week? in the next month? or even in six months?
Yes, in a typical day Scooter and its cousins visit over 10 million pages. If there are a lot of hyperlinks from other pages to yours, that increases your chances of being found. But if this is your own personal site, or if this is a brand new Web page, that's not too likely.
So what can you do? We'll go over in this in greater detail later, but, basically, you can go to AltaVista and at the bottom of the page, click on Submit A Page (see below for details), and simply type in the URL of your new page. The crawler will immediately fetch that page and hand it off to the index machines to be added to the index, probably by the next day. So instead of waiting for this random, many-month process, you can take control when your page is indexed.
My personal view is -- ignore the directions at AltaVista where it says only type in the URL for your home page. If you do that, once again, the pages that are linked to directly from your home page will be put at the end of this huge line of what to go to next. And after they are fetched, then the URLs in those pages will go to the end of the list; and so on, until your whole site is found and indexed. If you want control over the process, if you want all your pages in the index tomorrow, then you should add each and every page individually.Full-text index is a very important concept.
Large companies often misunderstand how AltaVista works.
There is such a thing as a "metatag" for "keywords", which really confuses Webmasters. Webmasters often think that AltaVista and other search engines only search for things that appear in metatags -- special commands embedded in the header of a Web page. That is not the case. Every word on the page counts.
The implications of full-text search only became clear to me when I was talking to Brian Reid, director of one Digital's research laboratories in Palo Alto, California. He mentioned that he had saved every email message that he sent or received for over 15 years. He just threw them on a disk. He's in a research environment; he has as much disk space as he wants. He didn't bother to put them in any directory structure. He didn't bother to name them. He didn't have to. He just threw them on disk. He knew that with the direction of technology sooner or later that would be valuable to him. Along comes AltaVista, and he can find anything he wants. He can search by date, by a person's name, by a phrase, by anything he wants and get what he wants immediately.
By the way, now, there is a free personal version of AltaVista Search that you can download from the AltaVista site. It will index all of your mail, and your Word files, and your text files, and your HTML files. It will index a number of file types that can't be indexed at the public site.
What I got from talking to Brian Reid was a sense of the value of disorganized information.
We have been trained to think that order is good and disorder is bad, but there are times when that isn't the case.
There are interesting possible applications for full-text indexing as a complement to databases.
Many people have been trained to think, "If I have a lot of information and need to retrieve it later, the only way to handle that is with a database." That means you have to have define fields, categorize information, etc. There a lot of work that goes into setting that up, and a lot of maintenance work.
In the AltaVista style, there are no categories and there is no maintenance. It's all there in flat disorganized files, and that disorder is valuable -- because basically any time that you use your human intelligence to make judgements on information to split it up into categories, you are making that information less accessible and less flexible in the long run.
Think of the old-fashioned work environment where there were rows and rows of file cabinets. And a clerk would go through, very carefully filing things, day after day. Eventually, he gets a gold watch and goes away, and suddenly hundreds or even thousands of files will never be found -- no matter how well that person followed the rules.
Likewise, with a database, you are putting information into pigeonholes. What happens when the categories of the world change?
Five years ago, there was no Web. Many of the categories that we normally use in our day-to-day business lives today didn't exist five years ago. Any set of information that we categorized back then would be much less useful to us today. And five years from now in the future, the world will have changed again.
There are many ways in which if you don't have to categorize information you can have better access to it.
Another thing to keep in mind -- which is more a function of how search engines work -- is that public Internet search engines don't index information in databases.
When I talked about Advanced search [in the companion speech, "I want to find"], I used the example of recruiting employees with special qualifications. I showed how you could enter the word "resume" in the query box and then list all the qualifications you are looking for in the ranking box, and those resumes with most of those qualifications would appear at the top of the list of matches.
Today there are thousands of Web sites devoted to jobs. Every one of those sites uses a database. And what are they putting into that database? Just text. They are just putting in resumes and job descriptions. That's information that would work beautifully in an AltaVista environment. But because they use databases, that information is not indexed at AltaVista.
So if I went to the trouble of entering my resume at 200 different job-related Web sites, a headhunter going to AltaVista and looking to hire someone with credentials just like mine will never find me, won't even know that I exist. But if I simply put my resume up as a plain text page on the Web and ADD URL at AltaVista, then I could be found very easily. And, in fact, the word "resume" appears on over a million Web pages in the AltaVista index.
It's an interesting phenomenon that in this case a database gives you less.
Full-text indexing is very interesting from another point of view as well. Think of all the vast sets of information in the world that have never been put into electronic form, because it would have been too costly to create a database for them, to categorize them. Think of the property records for the city of Boston. You know how much money lawyers get to do title searches whenever real estate changes hands. Well, with this technology, you could simply scan all that information -- page after page of it -- post it all as plain text Web pages, and ADD URL at AltaVista. Then I could search for "99 State St." and get a list of links to every page where that address was mentioned. Nobody would have had to organize that information at all, but I could get back everything that I wanted to know.
There are many huge sets of information like that that if you think of in this different way, you can do things with it that you never imagined before.
Remember about eight years ago the Federal government suddenly decided that they needed to know what country everybody was born in. They needed to know, among all the people who are living and working in the U.S., who wasn't born here? So suddenly, employers were required to include this information in their Personnel databases so they could report on it. Nobody had it there. They all had to go back and rewrite their programs and contact all their employees and fill out the forms that way, because their databases were static. Now, if in addition to their databases, they had had, indexed in an AltaVista style, free-form interviews that Personnel person could have had with people when they came on board -- "Tell me about yourself..." -- they could have captured much of that right off the bat. So unstructured information has value in unexpected ways.
If you let your imagination go on that concept, there are probably ways you could use that technology that had never occurred to you before.
This became clear to me from my own little site, back around the time when AltaVista first came out. In the fall of 1995, there were some other search engines out there that were already heading in this same general direction. On Halloween day, I went to a tele-seminar about the Internet. It was given by a couple friends of mine who are professors at the Harvard Business School. Being professors, they had to explain everything in terms of the whole history of mankind. They were trying to talk about how to make a successful business on the Web. And they took, as an example, Virtual Vineyards, which sells premium wines on the Web.
Their analysis of Virtual Vineyards was that, first, they don't own the infrastructure -- that's the Internet and that's available to anybody. Second, they don't own the product, because you could go down to the corner store and buy the same bottle of wine. So what is it that differentiates them? What makes this a successsful business?
Their conclusion was -- the context of the user's experience. Then they showed a videotape with the people who had designed the Web pages and ran the business. They talked very proudly about why they put this button here and that one there. And the way they described it felt to me like a branching adventure story, where you are going down a long hallway and are you going to open the door here? or the door there? And are you going to fight the dragon or run for the hills? And it's a very long hallway, and after maybe 20 choices, then you are going to decide which wine you want to buy.
Now, just around that same time, I had a little article on my Web site about Halloween. This was something I had written about 20 years before, that had never been published. (I put everything on my Web site.) Suddenly, around Halloween time, I was getting three times as many hits on that article as I was on my home page. So it was beginning to dawn on me that something was happening. People weren't going through the front door. When you have full-text search engines, people can come in anywhere -- any page at your site is a potential entry point. There is nothing special about a home page. All pages are created equal, as far as search engines are concerned. And when most traffic is driven by search engines, don't waste all that time on a home page. Pay attention to each and every individual page.
So when I heard them talking about how they had designed Virtual Vineyards, I thought, "They are in trouble." Because all of a sudden, they didn't have control of that context anymore. People could come in through the windows or the back door and go anywhere.
There are still some instances today, but there were many back then, of Web sites that were designed like the old Burma Shave ads (if you're old enough to remember those). You'd drive along the highway and there would be a series of about a dozen signs, and there would be just a couple words on each, and there were about a hundred yards apart, and there was a really good punchline when you got to the end. Well, doing a search at AltaVista, I would suddenly be reading the seventh sign, and I wouldn't have a clue what the context was, or where to go to next, or what was going on. It was totally bizarre.
So if you have a Web site, what do you do in this environment?
Naturally, you need to provide navigation buttons on every page.
My basic design principle is "maximum content for minimum clicks." I don't want people to have to click 20 times to come to a decision. It shouldn't take more than three clicks -- two is even better -- to go from anywhere at a Web site to any other page at that same site. Keep it easy. Remember, it's easy for your visitors to click on a bookmark to go back to a search engine and go anywhere else they want on the Web. At any moment, there are thousands of competing pages out there. I want to give them the information they need fast, rather than wait for them to get frustrated.Another interesting little trick -- you can do a search for host: followed by a domain name (like host:samizdat.com) to see ever page from a particular Web site that is in the index. If I have been rigorous and every time I add a new Web page or make a significant change to a Web page, I go to AltaVista and ADD URL; then I can use AltaVista as a free index of my site.
This dawned on me when people were coming to me and saying, "You have a lot of good stuff at your site, but it's just too complicated. There are so many things there. Can't you put some search software on your site." Well, I didn't want to spend any money on that. I didn't want to have to think of key words. I didn't want to have to go through all the maintenance problems when you are running your own separate search engine at your site. But AltaVista indexes it all automatically for me now.
First, I did it the simplest way. I made a hyperlink at my site. "Click here to launch a search at AltaVista and to search for only pages at this site. Just add your query after what you see there in the box." They'd click, get to AltaVista, and the query +host:samizdat.com would already be in the box. (The URL I linked to was a unique URL generated by doing a search at AltaVista for +host:samizdat.com. I simply did a cut and paste to make a hyperlink to that at my site.)
Later, somebody who had seen my site sent me a neat piece of code that let me do this using forms. So today it looks a bit more sophisticated. But it's a very simple notion -- if you have a small site, you don't need to buy software to make it easy for visitors to search your site, just piggyback off AltaVista.
Somebody could start a business in the next ten minutes, simply using AltaVista and going to Web sites and seeing the problems they have. Then send the Webmaster an email message with a quick diagnosis. "I've seen these kinds of problems at your site. I could help you fix them very quickly. Just send me a check for $200." It's just a marketing problem; the technology is sitting there at AltaVista.
For instance, I can do a search host: followed by the domain name, and then I can search for security terms -- company confidential, proprietary, top secret, etc. At a large site, you'd be amazed at how many instances there are of somebody accidentally putting up something that shouldn't have been put up.
Personal information -- many times people will include their Social Security Number, their home telephone number, their home street address -- information they probably don't want to be on the public Web. It was in a printed document before, and somebody just moved it onto the Web and didn't give it that extra second thought, "Do you want that on the Web?"Any time a list of matches comes up at AltaVista, each item has the date that that each of those pages was posted on the Web. So in a quick scan I can determine, "60% of pages at this site are more than six months old. And 25% are more than two years old."
Before AltaVista came along, the HTML title was a throw-away. Nobody paid attention to it.
This isn't the name of the file. This isn't the name in big bold letters across the top. This is the HTML title -- part of the header for the document.
That used to be a relatively insignificant piece of information. And many folks putting together Web sites and doing cut and paste to use one page as a template for another can easily make mistakes. I've ended up with 3 or 4 or 5 pages all with the same title, because I forgot I was cutting and pasting. And it wasn't until the next time I took a look at my site with +host:samizdat.com at AltaVista that I realized that I had made that mistake.
You also can very easily put up a page without an HTML title. And even folks who are very knowledgeable about the Internet do that.
The other day I checked a very professional Internet publishing site and saw that of the two dozen pages from that that were in the AltaVista index, more than half showed up as "No Title".
What's the importance of the HTML title? Two things.
First, when I get a ranking list -- that is what appears as the name of that page on the list. "No Title" doesn't really attract people to click to go to your page.
Second, in the ranking rules (which I discuss in more detail later), the HTML title is the number one, most important thing on your page. When people search for your kind of information, what are the words they are most likely to use? Those words belong in your HTML title and also in the first couple lines of text. When you leave that blank, or put unimportant words there, or you blunder and put the same title on many different pages, you've just thrown away the best way to get free traffic to your site.
Those are the kinds of things that you can tell about anybody's site very quickly.
Let's take a closer look. Connect to AltaVista. Click on Submit a Site, then on Basic Submit. Then enter the code that you see -- the odd-looking numbers and letters. That's a mechanism to check to make sure that you are a human being, rather than a robot. They had lots of trouble handling enormous numbers of submissions than were automatically generated. This is a way around that. Then enter the URL of the page you want to submit. With one submission code you can enter five different URLs. Then, if you have more pages, you can enter a new code.
If you have a small site, ignore their Express submission service -- it is ridiculously expensive. Unfortunately, they are currently pushing that service (which sends a crawler to your site weekly to check the pages that you are paying for.) And because they are pushing thatt, they have deliberately slowed down the free submission service. Whereas before, with the free service, you could be reasonably sure that your new pages would be in the index in 2-3 days; now they are running somewhere in the range of 4-6 weeks. That means that over time, the index will become skewed toward commercial sites that pay for the Express service, and that other content in the index will tend to be much older than before -- a trend that's likely to lose them users. Let's hope that they wake up and change before the damage is irreversible.
Keep in mind that you don't have to have any special authority to "add a page." This is not a directory, like Yahoo!, where the information provider has to submit information and has to prove they are who they say they are. No, all you are doing is saying, "Here's a URL. You wonderful dumb, blind crawler, please go and check this out." AltaVista doesn't believe a word you say except that there's probably a URL out there. It will go and check and bring back whatever text it finds at that address. All it knows is what it found from that page; not what you told it.
If you give it a URL for a page that doesn't exist, it will come back with Error 404, which means there is no such page (not that that the crawler couldn't get there due to some transient problem, but that a page with that address does not exist on that server). Then if that page was in the index, it will remove that page from the index the next day.
This is very important from several perspectives. Say you have changed the directory structure at your Web site. First, you should go to AltaVista and Add a Page for all the old addresses to remove the old information from the index. Then you should Add a Page for all the new addresses. Also, if you made an embarrassing typo or posted a document atht you shouldn't have, and removed that page from the Web, you can Add URL for that page at AltaVista to make sure the embarrassing information is not perpetuated in the index.
Then, as I mentioned in the search portion of this tutorial, you could use the command link: followed by a Web address to find out what Web pages have hyperlinks to a particular page or a particular URL. So you can use AltaVista to search for link: for each and every page that you have moved or removed. Then you can send email to the Webmaster of sites that have links to those pages that you moved and ask them to update their links.
Also, very often, when you do a search at AltaVista and it comes back with hundreds of thousands of matches, maybe two or three out of the first ten don't exist anymore. And you get upset, "Why don't they keep that thing up to date." Well, it's up to you to keep it up to date. We talked about thousands of threads bouncing around among a hundred million pages. Don't expect that in your lifetime this thing will ever be perfect. This is pragmatic. Their job is not to produce perfection. If it were, the company would go out of business. It would simply be far too costly to keep the index anywhere near perfection by automatic means. But whenever you find an instance of a page that doesn't exist anymore, if you simply click on Add a URL and type that URL in, then the next day, that page will be removed from the index. If we can get millions of people doing that on their own, then the index will be kept up-to-date pretty well. That's a lot better than writing some fancy new code and putting 50 more machines out there.
That's my seat-of-the-pants way of trying to deal with that situation.
There is another interesting use of that capability.
In the early days of the Web, quite often you would find Web sites that consisted of nothing but lists of hyperlinks to other pages. People would post their carefully constructed lists of pages on particular topics. The problem was that those lists soon got out of date -- because so many new pages were added to the Web every day and so many URLs went out of date as Webmasters moved files around. So now, rather than scramble to try to keep such a list current, you can try to construct a query at AltaVista which produces similar results, and put a link from your page to that particular query, so visitors at your site who click on that link will get the latest results.
I could put a whole set up of AltaVista search links on my page, saying, "if you are interested in X, click here". I could have carefully constructed all these great queries that are going to get people exactly the kinds of things they want out of newsgroups and Web pages. I just cut and paste the unique URLs that those searches generate at AltaVista and make hyperlinks from my pages. Then visitors to my site don't have to learn all the commands to take advantage of the power of AltaVista. I've done the work for them.
Depending on your interests, you might have a dozen or two dozen links at your pages that are launching particular searches at AltaVista that would give valuable information to the kinds of people you are trying to attract to your site. So you are providing a useful service to your audience, and you haven't spent a dime. You've spent a little time maybe doing the very kinds of searches that you'd want to do anyway.
You can also do a search for host: followed by a domain name (like host:samizdat.com) to see every page from a particular Web site that is in the index. And you can bookmark the unique URL for that search or make a link from your pages that will automatically launch that search.
That means that if you make sure that all your pages are indexed at AltaVista -- religiously doing an Add URL whenever you add a new page or make a significant change to an existing page, then AltaVista can serve as the search engine for your Web site -- at no cost.
The simplest way to do that is to add a hyperlink at your site, with a simple explanation: "Click here to launch a search at AltaVista and to search for only pages at this site. Just add your query after what you see there in the box."
If your domain name was samizdat.com, the Simple Search query would be
And when people clicked on the link associated with that query, they would connect to AltaVista and see those words already entered in the query box. They just need to add the rest of the query to perform a search that is limited to your site.
By paying attention to how crawlers and search engines work, you could get more traffic at far less cost.
First, sites that require any kind of registration or password lock out search engines. Keep in mind that a web crawler cannot fill out a form of any kind. If you need to fill out a form to get to the next page, the crawler halts right there. If you would like to gather information about your users/members but would also like your pages to be indexed, make the registration optional.
Similarly, a crawler cannot get content from a database, because it cannot fill out a form.
If the content of your database is largely text, you might consider creating plain text static HTML pages with that same content, so it can be indexed and found. (There's a software development tool kit that you can get with the intranet version of AltaVista Search software. It could help you automate the process of turning database content into static, indexable pages.)
Dynamic pages also block web crawlers. While it's great to give visitors to your site unique experiences, tailored to their needs, the techniques you use to do that could stop search engines from indexing your content and hence could greatly reduce your potential traffic. Dynamically generated pages are created on the fly from a variety of elements held in databases. Typically such pages have a question mark (?) in the URL. When a search engine crawler arrives at such a page, it captures the content but halts immediately, and will not follow the links, because it sees ahead of it an infinite number of pages ahead -- a black hole that would bring it to a crash.
Active Server Pages (.asp) with question marks in their URLs (indicating that the page is a script for the construction of a page, rather than just static content) fall into this category.
By the way, this is one reason why nobody can say how many pages there are on the Web, total. Every dynamic site has an infinite number of pages. And how many millions of dynamic sites are there?
If you have information inside frames, that will probably prove to be a hindrance, but is not an absolute barrier.
AltaVista indexes the outside of the frame as a distinct page. It will also index each pane of the frame window as a separate page. That means that if the content matching a query is in a pane, when visitors clicking on those links will see the pane and only the pane -- not the full page as it was designed. So if you want visitors from search engines to experience your pages the way they were intended to be seen, you should have non-frames as well as frames versions of those pages; and submit the non-frames versions with Add URL.
AltaVista also can't index text that is embedded in graphics. Have you ever been to a site that has a huge picture that takes minutes to paint across your screen at dial-up modem speeds and all the words are embedded in that picture? Search engines simply cannot "see" the text unless the Webmaster put ALT text behind the picture, describing it and listing those important words. But pictures, as pictures, can be indexed for Image search at AltaVista.
Text that appears in multi-media files (audio and video) cannot be indexed. But those same files can be indexed at AltaVista for Multimedia search.
Information that is generated by Java applets or in XML coding cannot be indexed.
Today, Acrobat files cannot be indexed either. But technology exists that will enable AltaVista to convert those files to indexable form. That's used in the intranet version of AltaVista Search software and may be implemented sometime soon at the public search site. But if you need to be found now, you should provide plain HTML versions of those pages and point the crawler to those (with Add URL).
Exceptionally large pages also present a problem at AltaVista. As a pragmatic compromise, intended to help optimize the performance of AltaVista, they fully index the first 64 Kbytes of text on any single page. They will harvest the hyperlinks from the whole document for following up later, but they will only index the first 64 Kbytes. So if you want to post an entire book, it's best to break it up into chapters, and then all the text can be indexed.
Comments, such as <!--change this every Friday-->, aren't indexed at all. Those are intended as private communications, not viewable by Web site visitors, except by using View/Page Source.
Also, consider technical factors. If a site has a slow connection, it might time-out for the crawler. Very complex pages, too, may time out before the crawler can harvest the text.
If you have a hierarchy of directories at your site, put the most important information high, not deep. Search engines will presume that the higher you placed the information, the more important it is. And crawlers may not venture deeper than three or four or five directory levels.
In addition, it helps to have a good central page with good navigation to all the other pages at your site. Make it easy, not hard for the crawler to find all your pages by following the links internal to your site.
Above all remember the obvious -- full-text search engines index text. You may well be tempted to use fancy and expensive design techniques that either block search engine crawlers or leave your pages with very little
plain text that can be indexed. If search is important to you (and it should be -- as the best and least expensive way to attract targeted traffic to their pages), you should create a parallel set of plain text pages, with links from those to their full-blown flashy personalized Web experience, and submit the plain text pages for indexing.
Your rule of thumb should be to have at least one full set of your content available in a form that the blind can read it. The blind are some of the best users of the Internet today. They use text-only browsers and text-to-voice converters, and they are able to navigate very well unless people put up barriers. The same kinds of barriers that stop the blind also stop web crawlers. If you need to have a picture, be sure to label everything clearly with ALT text in the background, to explain what a sighted person would see. And by designing your site to accommodate the needs of search engine crawlers you will also probably make sure that your site complies with the provisions and the spirit of the Americans with Disabilities Act.Some barriers to being indexed derive from the misbehavior of a handful of webmasters who have tried a number of clever tricks, trying to fool search engines into making their pages show up high on lists of matches and show up as matches in instances when the content of their pages is quite different than what the visitor is looking for. That kind of behavior is known as "spam." Left unchecked it would degrade the value of the index and be a nuisance for all.
The logic that leads people to try such tricks is rather bizarre. "I figure everybody searches for the word 'sex.' I don't have any sex at my site, but I want people to stumble across my site. So I'm going to put the word 'sex' three thousand times as comments. And any time that anybody searches for 'sex' that will come up first."
People have actually tried that. They have tried doing the same kind of thing in the wallpaper (the backgrounds) of their Web pages. They have also created page after page of text that is in the same color as the background color so visitors won't see the words, but search engine crawlers will. They have tried everything imaginable to fool search engines.
But AltaVista doesn't index comments at all, because, by definition, that is information that is not meant to be public and hence should not be indexed. Also, AltaVista does not reward Web pages with useless repetition. Actually, AltaVista only counts to two. It was designed that in at the the beginning. They didn't want the Web to get totally corrupted with people repeating words uselessly. (To get a sense of how bad that could get, look at a phone book, cluttered with names that begin AAAA... from people trying to appear near the front of the book. And imagine how cluttered the Web would be with if people thought that repetition actually got them to the head of the list.)
Some of the many symptoms of spam that AltaVista looks for are
This means that if being found by way of search engines is important to your business, you should be very careful about where you have your pages hosted. If the hosting service also hosts spammers and pornographers, you could wind up penalized or excluded simply because the underlying IP address for that service is the same for all the virtual domains it includes.
The simple rule of thumb is that content counts, and that content near the top of a page counts for more than content at the end. In particular, the HTML title and the first couple lines of text are the most important part of your pages. If the words and phrases that match a query happen to appear in the HTML title or first couple lines of text of one of your pages, chances are very good that that page will appear high in the list of search results.
Say you want to put your resume on the Web. Keep this rule in mind. You don't put your name first. You aren't trying to be found by people who already know you. You want to be found by people who have never heard of you. So don't waste any letter in the HTML title on your own name. The first word should be "resume". After that, list your main qualifications and the kinds of jobs that you are looking for. Put the same kinds of things in the first couple lines of text. That's what will come up as the default as the description in match list, and it's also an important position for ranking.AltaVista bases its ranking on both static factors (a computation of the value of page independent of any particular query) and query-dependent factors.
Query-dependent factors include:
I would strongly suggest that the people who are in charge of branding rules at large corporations should take a good look at how AltaVista works and find a compromise that gives them the right look-and-feel, but doesn't chase traffic away. Because you then have to spend money trying to attract that same traffic.
For instance, you can add to each of your pages a "description" metatag that allows you to enter a couple lines that describe that particular page. If you have such a metatage, those words will appear as the description for your page in results lists at AltaVista, instead of the default, which is the first couple lines of text.
But what people don't realize is that for ranking purposes, the HTML title and first of text still takes are still very important. Metatags do not take precedence. They are text also. So if your page is poorly designed, with random words associated with graphics, and with a meaningless HTML title, your description metatag is not going to help your ranking. You would be better off with a page that clearly stated what it was about in the title and the visible content.
You can also have "keyword" metatags. "Keyword" is a misnomer. Many webmasters misunderstand the purpose of such metatags, and presume that AltaVista acts like a database, and that these are the only words on a page that are important for search. On the contrary, AltaVista indexes every word on every page, and every word (and the order in which they appear) is important. The purpose of the "keyword" metatag is simply to allow you to add synonyms -- words that are appropriate for what's on your page, that describe what's there, but that do not actually appear on that page. One of the best uses for keyword metatags is for foreign translations of the main words on your page, so, for instance, somebody searching in French will find that page.
Many Webmaster think that by putting words in keyword metatags they are getting some advantage in the ranking or making up for the fact that their pages have very little text content -- just flashy effects. But, no, those words are worth little more than any other word in the main text of the page. There is nothing "key" about it. You have simply added a few more words to the page in a place that is not visible.
Why aren't metatags given precedence? Consider the opportunity for abuse/spamming. What matters most to user of AltaVista is the actual content that is visible on Web pages, not the marketing-oriented notes that have been added in metatags. And text that appears in the title and the first few lines is likely to be closely related to the main point of the page.
The basic rule of thumb for metatags is that plain English is always best. Metatags are a band aid to help you deal with pages don't clearly state what they are about in clear text, right up front. Do it right to begin with, and you don't need metatags at all, and you'll get far better results in terms of search engine traffic than you would if you depended on metatags.
Keep in mind that the Internet is a different place than what we are normally used to. One of the worst things you can possibly do is take existing brochures and other material and simply put it unedited up on the Web. Marketing brochures are not written to inform. They are written to tease. They are not supposed to answer questions. They are supposed to make people ask more questions and ask for the next brochure and then ask again and then talk to a salesperson. If you give an answer, they are afraid that the salesperson will never get in the loop.
The Web is different. People on the Web want answers. They don't want to have to click forever to get those answers. And if they don't get the answer from you and get it quickly, they are going to go to somebody else and get it there.
So you want to write material for the Web in plain clear English. This is both for purposes of getting properly indexed and ranked, but also for properly serving your audience.
The crawlers from AltaVista and other major search engines obey a Robot Exclusion Standard. If your Web site is on your own servers, you just create a simple little text file and name it "robots.txt". You can be very precise about what you want to exclude. You can exclude a particular crawler or all crawlers from your entire site or from particular directories or from particular files. If you site is hosted at an ISP, you'll need to ask the ISP's webmaster for help with this.
In any case, if you want to use robot exclusion, keep in mind that Web server software often comes with a directory indexing feature. If your software has that feature and that happens to be on, then any crawler that comes to you site could grab everything right out of the index, even if you had set up for robot exclusion. So the first thing you have to do is shut off the directory indexing feature.
To exclude your site from all web crawlers create a file named robots.txt that states:
User-agent: * # directed to all robots
To exclude just the AltaVista crawler (known as "Scooter") your file should read:
User-agent: scooter # AltaVista web page search
To exclude your site's pictures and multimedia files from AltaVista's image, video and audio clip index:
User-agent: vscooter # AltaVista Image Search
To limit the exclusion to a particular directory or file, put that address after Disallow: For instance,
You can also use Metatags to exclude crawlers from particular pages.
For example, if you add the following line to the header of one of your Web pages, the crawler will not add this page to the index and will not follow the links it finds there.
<META name="robots" content="noindex, nofollow">
AltaVista's crawler currently recognizes these metatag exclusion options:
NOINDEX prevents anything on the page from being indexed.
NOFOLLOW prevents the crawler from following the links on the page and indexing the linked pages.
NOIMAGEINDEX prevents the images on the page from being indexed but the text on the page can still be indexed.
Excluding search crawlers from particular files can give you a way to assert some control over the visitor's experience at your site. For instance, if you wanted to hold a trivia contest, you could put robot exclusion on the pages with the answers; so people wouldn't be able to find those pages randomly -- they'd only find the pages with the questions. If you want to do a one-two punch like that, robot exclusion lets you.
This is a principle called "flypaper," which I discovered at my little Web site. It comes from the combination of search engines and lots of personal Web pages.
I started getting two or three email messages a week from old friends I hadn't heard from for twenty to thirty years. At first, I thought, "Gosh. Amazing. They're looking for me." Then I started thinking, "Why would they be looking for me?"
I hardly knew these guys. I wouldn't look for them.
I pushed back. I found out they weren't looking for me. They were looking for themselves.
They went to AltaVista and searched for their own names; and there's so much stuff at my site that anybody who had any contact with me over the last thirty years or so is mentioned there. They found themselves, and they wrote to me.
The more I talked to people, the more it became clear to me that this is a matter of human nature mixing in strange ways with technology.
Human nature is that people first look for themselves or for their friends or for things that are near and dear to them. And only after they have done that do they do research.
We designed this site for research, but it's used for these other purposes, probably far more.
But you'd never learn that by checking the logs at AltaVista. Because the logs will give you the top terms that people have searched for. Well, Joe Blow from Minneapolis looking for his own name won't appear very high on that list. But that's the first thing he'd look for. And lots of people are doing that.
So it occurred to me, there's a business model there. If people act that way, I can build a better mousetrap.
Instead of going out with flyswatters to try to find things, I'll create flypaper to get them to come to me.
Say, for instance, you have a business proposal. You've been trying to get through to someone who you know uses the Web. This guy doesn't answer your email, doesn't return your phone calls. You have a really good message for him.
Well, create a Web page. The first word in the HTML title and the first line of text -- the guy's name. Then add in those same places everything you can think of that is near and dear to him. Then you add the message that you really want him to see.
You need to have hyperlinks to this page from anywhere in the world. You just go to AltaVista and ADD URL.
The next time that guy searches for himself, he finds you.
Believe me, it really changes the whole dialogue. He's coming to you; you didn't go to him.There are two kinds of flypaper. That's what I call "targeted flypaper." Another kind of flypaper is "generalized flypaper."
I put anything and everything at my site. One of the items I posted is a list of every book I've read for the last 39 years -- since I was in junior high school. I'm obsessive. I've kept such a list. It was in electronic form. It tooks me just a few minutes to post it on the Web. So why not? It's my site. Who cares? If somebody wants to come to it, fine.
It turns out that that is the most trafficked page at my site. And I get email from authors, their editors, and their agents. They are searching for themselves and then for the stuff that's near and dear to them.
I get very good dialogues going with them and with readers who particular like those same books.
If you start thinking in that sort of vein, then the kinds of material you put on your Web site will be very different from what you are probably doing today.I have a few examples here of bizarre things that have happened to me.
Back in the early 1970s, I self-published a book of mine called The Lizard of Oz. That was back when a lot of people were playing around with self-publishing and small press publishing. All of sudden, thanks to photo-offset, it had become cheap to print; and we made the mistake of thinking that printing was the same as publishing. So we printed all these books and went around to small-press bookfairs and met a lot of people and had a lot of fun and sold very few books. You couldn't get distribution, you could't get them into book stores.
So this book of mine had been gathering dust in the basement for over 20 years. And along comes the Web and all of a sudden distribution is free. So one of the first things I did when I got my own free little Web space was to put up the full text of just about everything I had ever written.
Shortly after I put up The Lizard of Oz, I got email from a little company in San Francisco that does interactive CD ROMs for kids. A brand new company. They couldn't afford the time or the money to have acquisition editors out looking for material. So they were using AltaVista to search the Web for stories that might be useful for their product line. They found my book. They loved it. And two weeks later, we had a contract. They still haven't come out with the CD ROM. I hope they do it soon. But it was a very good contact, and one that I would have never made. I could have never found them, because they were new and wouldn't have been listed anywhere. But they found me. And because they found me, it was a very quick and different kind of dialogue. I wasn't trying to sell the book to them. They were trying to convince me to let them publish it.The next example is even more bizarre.
I wrote a movie script a long time ago and nothing has ever happened with it. I posted it at my site.
Now this didn't lead to a business deal. I'm not going to be famous tomorrow.
This was an eye-opener though.
I got email from a producer in Iceland. We went back and forth, and I sent him the script. He was interested, but we didn't arrive at a deal.
But would you ever have imagined that there were producers in Iceland? If you were trying to sell a script, would that have been on your list of places to try?The next example is totally off the wall. This was due to my list of books.
I got email from someone who collects Gary Trudeau books (the cartoonist who does Doonesbury). He saw that I had read Trudeau's first book, which he had self-published back when he was an undergraduate at Yale. I happened to have been at Yale at the time. I picked it up at the Co-op. It had been gathering dust ever since.
I thought, "Okay. I'll sell the book. I don't have any real need for it. I haven't looked at it in nearly thirty years."
Well, the guy took a further look at my Web site. He saw that my daughter is into acting. She goes to Sarah Lawrence. This year she is doing her junior year in London doing drama. And at the time we were doing this correspondence, she was in Los Angeles for the summer, staying with my sister, trying to get summer jobs in acting. It turned out the guy who had done the query was Lee Aronsohn, who at various times has been executive producer and writer of a number of popular TV series, such as Cybill and Grace Under Fire. He, out of the blue, suggested that in exchange for the book, my daughter could get an audition for a show.
She didn't get a part, but she got an audition with someone who has won Emmys for casting. She got to meet the people. She got a sense of how the business worked. This was an invaluable kind of experience for her.
There is no way in the world that I would have ever come up with that as a business model.The general message from that is -- don't limit yourself to your own imagination.
You have ideas sitting in your drawers. You have works in progress. You have things that are almost there. Don't you edit yourself out of the possibility.
If you have something that might work, that might fly, that somebody might interested in, put it up and get it well indexed and see what kind of response you get.
You might be afraid that someone will "steal" your idea.
Remember that there is such a thing as copyright law. When I put a book up, like The Lizard of Oz, I include a copyright notice. And I say, "Permission is hereby granted to disseminate this in electronic form for non-profit purposes or for you won purposes. If you wish to do something commercial with it or to print it, then contact me. Here's my email address."
Now what protection does that give me? Well, if somebody is going to do a commercial edition of that thing, and they are going to make a lot of money out of it, I'm going to hear about it, and I'll sue them. And that would be fine.
If somebody wants to plagiarize... If I were paranoic I could use AltaVista periodically to search the Web for selected paragraphs of my works. I don't do that all the time, but I could.
The more frequent question I get from an audience like this is the fear that somebody will take you idea. I'm talking about writing fiction for kids and copyrighting that. But if you have an idea for a new business, what's going to happen if somebody steals that idea?
Take my advice with a grain of salt. But I believe that if the world is ready for an idea, more than one person is likely to have it. And your best protection is to let the world know that you have this idea so that you then become part of the dialogue as it goes to its next stages. And you begin the dialogue and take it to the next step. You get your writings about this subject well indexed. You get in touch with other people with similar interests, and maybe you build a company out of the people you pull together that way.
But, if you take that idea, as we always have in the past, and keep in the bottom drawer, afraid that somebody is going to steal the idea; then two years later, you are going to read in the Wall St. Journal about a $10 million company that's doing that same thing that you had the idea for.
If you are talking about software code and the kind invention that typically involves patent lawyers, that's another matter and you should get legal advice. But if you have a business idea, I'd say, tell the world about it; don't hide it. Tell them as loudly as possible, so everybody knows that you told them about it. And then you can become part of the on-going dialogue.
The underlying principle of this advice is the realization that the Web is an awful place to put totally polished, finished text. We see one after another that on-line magazines are too expensive to sustain themselves, and they fold. But the Web is a great place to put ideas and works in progress.
When you put finished text on the Web, that's like saying, "I have just returned from the mountain. I have seen it. Don't bother to write me. This is the final answer."
What you want to say instead is, "I think is headed in the right direction. I might be 80% there, or maybe 50%. But I feel good about this. I want to talk to people out there and take this to the next step. Please send me your reactions."
Then when people send their reactions, treat them with respect. And with their permission, if they've written something cogent, whether they agree with you or disagree with you, add it to your document as a letter to the editor and then go back and index again at AltaVista and let that document grow.
I mentioned my Halloween article. That's what I did with that. It's just a little article -- a couple pages long. Now it rambles on and on and on, because for three years I've had people sending me messages saying they agree or disagree strongly, and they often write very well about it. I've added those comments as letters to the editor, and some of it is very informative and some is well-argued opinion. But your one little static document suddenly becomes the start of a social experience, without needing any fancy software to do it.
Take those idea that you have been editing out, that you've been chopping off at the legs, and give them a chance. Let the world find them. Let the world do something with them.
To me that's what's exciting -- when you put
the pieces together: your own creativity and the power of a
search engine like AltaVista.