[ADMAX-2826] Cloud Spider: Spider Fetches Lot More URLs Than Requested, If a Large Site Contains Many URLs with Issues - Admax Local

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: Cloud Spider 3.01
Fix Version/s: unspecified
Component/s: Cloud Spider
Labels:
None

Description

From my email:

The "Total URLs Spidered" is currently counted as all Successful pages (Response code 200)

I think it makes sense to count all URLs that we encounter (ones that are all response codes, errors) towards spidered URLs count.

Here is the reason for that, based on crawls we see in production:

Consider a site having 5 Million URLs, of which 4995000 URLs have errors in them, user wants to do a quick small crawl of 5000 URLs.

The spider in this case will end up crawling ALL 5 Million URLs! to find the 5000 "Good" URLs, the other 4995000 URLs will be flagged as URLs with errors/issues etc. And the whole crawl will go on for days! Or probably fail with a "out of memory" error, since it will be attempted on a small node.

This is clearly not the intention of the crawl request. Which I think should stop after 5000 URLs are found - which are good or bad.

If this change gets approved, it will be a very quick change in code (2-3 lines of code change). If not, we need some other change to take care that the spider does not crawl way more than it is requested.

Attachments

Activity

People

Assignee:

Antony Rajiv (Inactive)

Reporter:

Antony Rajiv (Inactive)

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

13/Sep/11 3:03 PM

Updated:

22/Sep/11 10:13 AM

Resolved:

22/Sep/11 10:13 AM