Uploaded image for project: 'AdMax'
  1. AdMax
  2. ADMAX-2894

Cloud Spider: URLs found during crawl which do not match allowed URLs regex pattern cause crawl to hang

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Cloud Spider 3.03
    • Component/s: Cloud Spider
    • Labels:
      None

      Description

      Based on crawl input options regex patterns are created to see if a URL is valid for the crawl or not. Some of these URLs found during the crawl never get validated by our URL pattern validator regex. This causes regex to go on for long time consuming a lot of CPU and the process hangs mid way.

      We need to either change the regex or use a different logic to validate a URL for the crawl.

      Currently affected crawls were

      Crawl 1499 - http://www.monster.de/

      Crawl 1500 - http://www.monster.fr/

        Attachments

          Activity

            People

            • Assignee:
              antony Antony Rajiv (Inactive)
              Reporter:
              abhiram Abhiram Bhagwat
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: