Uploaded image for project: 'AdMax'
  1. AdMax
  2. ADMAX-2826

Cloud Spider: Spider Fetches Lot More URLs Than Requested, If a Large Site Contains Many URLs with Issues

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: Cloud Spider 3.01
    • Fix Version/s: unspecified
    • Component/s: Cloud Spider
    • Labels:
      None

      Description

      From my email:


      The "Total URLs Spidered" is currently counted as all Successful pages (Response code 200)

      I think it makes sense to count all URLs that we encounter (ones that are all response codes, errors) towards spidered URLs count.

      Here is the reason for that, based on crawls we see in production:

      Consider a site having 5 Million URLs, of which 4995000 URLs have errors in them, user wants to do a quick small crawl of 5000 URLs.

      The spider in this case will end up crawling ALL 5 Million URLs! to find the 5000 "Good" URLs, the other 4995000 URLs will be flagged as URLs with errors/issues etc. And the whole crawl will go on for days! Or probably fail with a "out of memory" error, since it will be attempted on a small node.

      This is clearly not the intention of the crawl request. Which I think should stop after 5000 URLs are found - which are good or bad.

      If this change gets approved, it will be a very quick change in code (2-3 lines of code change). If not, we need some other change to take care that the spider does not crawl way more than it is requested.

        Attachments

          Activity

            People

            • Assignee:
              antony Antony Rajiv (Inactive)
              Reporter:
              antony Antony Rajiv (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: