Uploaded image for project: 'AdMax'
  1. AdMax
  2. ADMAX-3162

Cloud Spider - URLs with ";" are not crawled correctly.

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Cloud Spider
    • Labels:
      None

      Description

      This is due to nutch bug https://issues.apache.org/jira/browse/NUTCH-1115.

      When the URL has ";" in it the URLs fetched from that page gets those same parameters appended to all the URLs.

      For example, If we are crawling URL http://www.schumacherhomes.com/news/schumacher-homes-updates-customizable-home-design-system;-clients-can-change-anything-they-want and it has below links

      http://www.schumacherhomes.com/galleries/models/
      http://www.schumacherhomes.com/about-us/

      Then due to above bug these URLs will be read incorrectly as below

      http://www.schumacherhomes.com/galleries/models/;-clients-can-change-anything-they-want and it has below links
      http://www.schumacherhomes.com/about-us/;-clients-can-change-anything-they-want and it has below links

      This needs to be fixed using patch mentioned in the above Nutch bug 1115.

      This was found in Crawl S-6519.

        Attachments

          Activity

            People

            • Assignee:
              abhiram Abhiram Bhagwat
              Reporter:
              abhiram Abhiram Bhagwat
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: