| T.R | Title | User | Personal Name | Date | Lines |
|---|
| 4577.1 | "too many URLs..." | LGP30::FLEISCHER | without vision the people perish (DTN 381-0426 ZKO1-1) | Wed Mar 26 1997 12:58 | 70 |
| This note touched upon a pet peeve of mine regarding the
AltaVista public search site. I recently gave them the
following feedback about a problem in adding a URL:
Date: Fri, 21 Mar 1997 17:12:05 -0800
From: Bob Fleischer <fleischer@lgp30.zko.dec.com>
Organization: Digital Equipment Corp NSIS
To: suggestions.altavista@pa.dec.com
Subject: "too many URLs..."
My son is developing a web site, and he wanted to get it listed on a
variety of search engines and web directories. The URL is
http://www.tiac.net/users/rjf/fizbin.html .
He was able to get it listed by Lycos, Excite, InfoSeek, Yahoo, and
several other services suggested by SubmitIt! (SubmitIt was suggested by
AltaVista, in fact).
However, he was not able to get AltaVista to accept his URL -- the
response was "Too many URLs at that site have been submitted---sorry."
Such a rejection wouldn't have hurt half as bad if it came from a site
which had no claim to being the biggest and fastest site built with
technology having the highest capacity -- but it seems absurd that
AltaVista of all sites should be the only one to say "I'm sorry, I've
reached my capacity!"
I'm sure that there are a lot of submissions from a site such as
www.tiac.net -- they are, after all, a large ISP hosting thousands of
users and businesses. No other search engine had a problem with that,
however.
--
Bob Fleischer
Digital Equipment Corporation, Network and Systems Integration Services
110 Spit Brook Road (ZKO1-1/J33), Nashua, NH 03062 (603) 881-0426
fleischer@lgp30.enet.dec.com http://www.tiac.net/users/rjf/
and their response:
Date: Mon, 24 Mar 1997 09:09:07 -0800 (PST)
From: Alta Vista Support <avsup@av-ops.pa.dec.com>
To: Bob Fleischer <fleischer@lgp30.zko.dec.com>
Subject: Re: "too many URLs..."
From ttress@pa.dec.comTue Mar 18 09:01:47 1997
Date: Tue, 18 Mar 1997 09:00:51 -0800
From: Ty Tressitte <ttress@pa.dec.com>
To: avsup@av-ops.pa.dec.com
In order to help ensure our users retrieve a fair and unbiased selection
of relevant pages, we have had to limit submission of pages from a
provider, site or individual user that fall into the following categories:
1. Submissions that are designed to limit our ability to rank pages
accurately.
2. Duplicate and near duplicate pages that both utilize excessive storage
and reduce the number of relevant search results.
3. Excessive amounts of pages submitted from a given provider, site or
individual in a day.
You can try submitting your page in the late afternoon when the
submission limits are reset.
Regards,
AltaVista Support
|
| 4577.2 | | BUSY::SLAB | Crazy Cooter comin' atcha!! | Wed Mar 26 1997 13:01 | 7 |
|
Yeah, I just tried submitting
http://www.icanect.net
and got the same "too many URL's" error.
|
| 4577.3 | | MKOTS3::HAHN | SBU Americas Technical Support Group | Wed Mar 26 1997 13:10 | 7 |
|
Thanks for the quick replies.
So I guess I need to send an e-mail to AltaVista Support and ask them to reset
the submission limits for www.icanect.net?
|
| 4577.4 | | HYDRA::VANORDEN | | Wed Mar 26 1997 17:35 | 32 |
|
>So I guess I need to send an e-mail to AltaVista Support and ask them
>to reset the submission limits for www.icanect.net?
Yes.
What is confusing is that your customer did not say they received the
"too many URLs at this site" error when they originally submitted their
URL. It could be that the URL was accepted, but for some reason could
not be fetched (network busy, server was down due to the renovation of
the site,etc). Under those circumstances the URL is not added.
The AltaVista Search Site has had problems with spamming...submitting a
large number of pages to the index in the hopes that it will increase your
appearance and ranking. I suspect that a/v views this as an ethical
issue rather than a performance issue. As they put it,
"Left unchecked, this behavior would make web indexes worthless". It
seems they attempt to solve this problem by putting a quota on URLs,
and expect scooter to find the other pages. Unfortunately this method
does not take into consideration the many unique web pages
which reside in directories on the same HTTP server (such as
www.icanect.net and members.aol.com). It is also possible that too
many attempts have been made to add the
http://www.icanect.net/kabbalah/ site that it now looks like an attempt
to spam, and AltaVista is ignoring the site.
You need to send mail to AltaVista Support alerting them of the problem
so they can bypass the quota on your customer's behalf.
Donna
|
| 4577.5 | | VAXCPU::michaud | Jeff Michaud - ObjectBroker | Wed Mar 26 1997 20:45 | 8 |
| > Yeah, I just tried submitting
> http://www.icanect.net
> and got the same "too many URL's" error.
What are you talking about Shawn? 1pm (when you posted your
note) is not late afternoon if that's what you are refering to?
And it's especially not late afternoon if it's Palo Alta time
(add 3 hours for ET).
|
| 4577.6 | | BUSY::SLAB | Don't drink the (toilet) water | Thu Mar 27 1997 01:25 | 6 |
|
If I remember right, I posted that reply before reading the prev-
ious one with the time recommendation.
I guess I should of written it quicker.
|
| 4577.7 | Someday when AV Forum merges with Notes, maybe we'll hear from AV employees... | CIRCUS::GOETZE | Tibetan karma not Made in China | Thu Mar 27 1997 16:37 | 23 |
| The "problem" with AV not indexing all of a site has become a public
issue:
[ Forwarded message ]
>AltaVista also is not as deep a search engine as you might think. The
>email below is from zdnet's "talkback" area. The URL is
>
>http://www5.zdnet.com/anchordesk/talkback/talkback_11638.html
>
>The author reveals that AltaVista doesn't index all pages on a site --
>indeed, geocities.com, with 20,000+ pages, only has about 300 listed.
>You can check this by doing a search for host:geocities.com on
>AltaVista.
If this is true, I'm just as shocked as the writer above. I thought
that the original conversation by some DIGITAL researchers at
Left at Albuquerque about the idea for AltaVista was essentially
"index the entire Web". If the scope of the project has at some point
been scaled back, it seems to have been done without any public notice.
erik
|
| 4577.8 | | BUSY::SLAB | Form feed = <ctrl>v <ctrl>l | Thu Mar 27 1997 17:17 | 7 |
|
I'm not sure how often the index is updated, but whatever the time
span is it's too long.
All too often I get "not found" errors because the page has dis-
appeared since the index was last created.
|
| 4577.9 | | QUARK::LIONEL | Free advice is worth every cent | Sun Mar 30 1997 20:34 | 20 |
| AltaVista has indexed only a handful of the ourworld.compuserve.com
sites - my page there keeps getting dropped from AV's list. The
problem, as Louis Monier once explained to me, is that AV tries to be a
"good citizen" and stops indexing a site if it is pulling down what it
thinks are too many pages.
One unfortunate side effect of people having difficulty getting URLs
indexed is that some of them decide that the way to do it is to "spam"
the Digital e-mail list with demands to get the site indexed. One jerk
set up a batch job to remail the same complaint to the entire list
every day.
(Another jerk decided that he DIDN'T want his site indexed - he refused
all suggestions for how to do this on his own and blasted the e-mail
list dozens of times a day with incoherent rantings.)
Unfortunately, it seems that the AV people don't have adequate staff to
respond to inquiries.
Steve
|
| 4577.10 | | CIRCUS::GOETZE | Tibetan karma not Made in China | Tue Apr 01 1997 14:46 | 12 |
|
re: not indexing the entire Web:
I hear it's simply a budgetary problem--if AV were to index
all known pages *and* keep the same responsiveness level it has today,
it would take three times as many turboLasers as they use today
(12, each with max CPU boards?).
That's too bad. I might start using a search engine which does attempt
to index all pages.
erik
|
| 4577.11 | well, they're Digital | LGP30::FLEISCHER | without vision the people perish (DTN 381-0426 ZKO1-1) | Wed Apr 02 1997 06:26 | 12 |
| re Note 4577.10 by CIRCUS::GOETZE:
> That's too bad. I might start using a search engine which does attempt
> to index all pages.
Yes -- indexing everything is AltaVista's claim to fame.
(Yes, I understand that it was never possible to really index
"everything", but one would expect that static, un-hidden
pages would certainly be indexed.)
Bob
|
| 4577.12 | Spiders can only crawl over links | STAR::COPE | | Wed Apr 02 1997 09:40 | 16 |
| Also, AV only indexes pages its crawler can find (via links from
other pages), correct? If I were an ISP, and allowed my users to have
homepages at
http://myisp.com/~joeuser/index.html
Joe's page would never make it into AltaVista unless and until there
was a link to get there from some page in AV's space (which, I
expect, starts at large sites like Yahoo and works its way down?)
Isolated groups of pages with no outside references just aren't
going to make it.
(I'm just guessing here; this may not be all that relevant to which
pages get passed over... but it seems like another thing to consider.
Spiders can only crawl; they can't jump.)
|
| 4577.13 | AltaVista | CONSLT::OWEN | Stop Global Whining | Wed Apr 02 1997 09:44 | 21 |
| I think AltaVista might save some space if they put a size limit on
pages that it indexes. More and more I'm seeing log files and other
data dumps get indexed. Pages that are many many megs in size. And
since these contain SO MUCH information, they often pop up in search
results even though they have nothing to do with what you're
searching for.
I don't think it is unreasonable to ask that if a page is to be
indexed, its size is kept to a reasonable size. I don't know what
that number is, but it's certainly less than 1 meg. Maybe 200K.
Lower? Higher?
How about dumping pages that haven't been updated in over a year?
Like others have said here, it's really a shame that AltaVista punishes
people on large ISPs with lots of pages. TIAC is a good example. If
some bozo on TIAC spammed the index, don't make everyone else pay for
it.
-Steve
|
| 4577.14 | | DECCXL::WIBECAN | That's the way it is, in Engineering! | Wed Apr 02 1997 09:52 | 8 |
| >> How about dumping pages that haven't been updated in over a year?
Bad idea. There are a great many sources of information that do not change
over the years. If someone were, for example, to supply the complete works of
Shakespeare over the web, there would be little reason to change the pages,
ever.
Brian
|
| 4577.15 | | BUSY::SLAB | Dancin' on Coals | Wed Apr 02 1997 10:22 | 6 |
|
RE: .12
That doesn't explain something like Geocities, though, which has
links to all of its pages on the main page.
|
| 4577.16 | 4MB limit imposed? | WOTVAX::16.194.64.183::watson | OK, whats todays long term strategy? | Wed Apr 02 1997 11:32 | 6 |
| re: .13
According to the book "The AltaVista Search Revolution" page 91, files over
4MB are truncated.
-- Rob
|
| 4577.17 | The complete works of Shakespeare... | STAR::PITCHER | Steve Pitcher/Pathworks for OpenVMS | Mon Apr 07 1997 12:57 | 11 |
| re: .14
What "If"!
>> If someone were, for example, to supply the complete works of
>> Shakespeare over the web,
See: http://the-tech.mit.edu/Shakespeare/works.html
- stp
|
| 4577.18 | Adding personal URL to AltaVista | VMSNET::RRICK | I'd rather be fishing! | Wed Apr 09 1997 20:14 | 10 |
| It is possible to add your personal URL's of the form,
www.network.com/~myusername
At the bottom of the ALtaVista search page is an option
Add Url. Just place your personal web page there and you're all set.
I did so and showed up in Alta Vista the next day.
Randy
|
| 4577.19 | | QUARK::LIONEL | Free advice is worth every cent | Thu Apr 10 1997 11:46 | 3 |
| But keep a watch out - it is likely to disappear after a month or two.
Steve
|
| 4577.20 | | PCBUOA::BAYJ | Jim, Portables | Thu Apr 10 1997 15:05 | 12 |
| On the other hand, getting a page *out* of some of the other search
engines is darn near impossible. I have a page that was indexed in
November. A couple months ago I placed noindex meta tags and a robot
file there, but it is STILL there, still showing a data of November.
Some pages may get updated frequently, but not all. Either that, or
once a page is cataloged, when they go back, they only check if its
there, and don't look for the meta tags or robot file during the
refresh pass.
jeb
|
| 4577.21 | the index is *very* sticky | FIEVEL::FILGATE | Bruce Filgate SHR3-2/W4 237-6452 | Sun Apr 13 1997 18:28 | 9 |
|
There are lots of dead links in the Altavista index, almost as if
once an entity is indexed, it is never again checked. More likely
the memory algorithm is just overly sticky. Sorry to .-1, but the
dead links I checked had been dead much longer than a couple of
months, one appeared to have been taken down 10 months before
Altavista pointed me there.
Bruce
|
| 4577.22 | | JAMIN::OSMAN | Eric Osman, dtn 226-7122 | Tue Apr 29 1997 10:27 | 12 |
|
I've never wanted a page of mine altavista'd. But if I did, and
I just found the owner of some page that was already altavista'd and I
convinced that owner to link to my page from theirs, then wouldn't
mine automatically be altavista'd within a week or two (how often does
altavista do its crawl ? How long does it take to crawl ?)
Would this method work better than the direct-email method (which seems
to dead end in too-many-requests error)
/Eric
|