[ADMAX-3147] URLs with parsing exceptions are missing from ADR's CSV file (and from index) - Admax Local

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Sustaining
Component/s: Cloud Spider
Labels:
None

Sprint:
Sprint 2

Description

When certain URLs respond with 200 status code but fail on parsing because of parsing exception they go missing from index and are not reported.

Some of the examples are, URLs which are linking to an image, javascript file or pdf documents but don't have the correct extension to identify them as img link or pdf link which we filter.

e.g. <a href="http://www.loeb.com/analytics.axd?js=main">Javascript link</a> . In this example since URL doesn't end with .js crawl assumes this is a valid link and tries to parse it but since it is is pointing to javascript file parsing fails and the URL is gone from index.

After talking to client services(Mark Fillmore & team), they suggested to add these URLs in index with a different status as "Parsing_exception" and HTTP response code as 0. This way they will be identifiable in CSV file. This will happen only to those URLs which return non HTML content and their content type is not identifiable from URL pattern.

Attachments

Activity

People

Assignee:

Jeff Shih (Inactive)

Reporter:

Abhiram Bhagwat

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

17/Dec/12 1:48 PM

Updated:

24/Jan/13 7:52 PM

Resolved:

24/Jan/13 7:52 PM