Details
-
Type: Bug
-
Status: Open
-
Priority: Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: Cloud Spider
-
Labels:None
Description
This is due to nutch bug https://issues.apache.org/jira/browse/NUTCH-1115.
When the URL has ";" in it the URLs fetched from that page gets those same parameters appended to all the URLs.
For example, If we are crawling URL http://www.schumacherhomes.com/news/schumacher-homes-updates-customizable-home-design-system;-clients-can-change-anything-they-want and it has below links
http://www.schumacherhomes.com/galleries/models/
http://www.schumacherhomes.com/about-us/
Then due to above bug these URLs will be read incorrectly as below
http://www.schumacherhomes.com/galleries/models/;-clients-can-change-anything-they-want and it has below links
http://www.schumacherhomes.com/about-us/;-clients-can-change-anything-they-want and it has below links
This needs to be fixed using patch mentioned in the above Nutch bug 1115.
This was found in Crawl S-6519.