Details
-
Type: Bug
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: Sustaining
-
Component/s: Cloud Spider
-
Labels:None
-
Sprint:Sprint 2
Description
When certain URLs respond with 200 status code but fail on parsing because of parsing exception they go missing from index and are not reported.
Some of the examples are, URLs which are linking to an image, javascript file or pdf documents but don't have the correct extension to identify them as img link or pdf link which we filter.
e.g. <a href="http://www.loeb.com/analytics.axd?js=main">Javascript link</a> . In this example since URL doesn't end with .js crawl assumes this is a valid link and tries to parse it but since it is is pointing to javascript file parsing fails and the URL is gone from index.
After talking to client services(Mark Fillmore & team), they suggested to add these URLs in index with a different status as "Parsing_exception" and HTTP response code as 0. This way they will be identifiable in CSV file. This will happen only to those URLs which return non HTML content and their content type is not identifiable from URL pattern.