Details
-
Type: Bug
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: Cloud Spider 3.03
-
Fix Version/s: Sustaining
-
Component/s: Cloud Spider
-
Labels:None
Description
When a URL just responds with header and does not have any body content Nutch throws following exception and marks the URL as failed
2012-01-11 10:52:55,377 ERROR httpclient.Http - java.io.IOException: unzipBestEffort returned null
2012-01-11 10:52:55,378 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:518)
2012-01-11 10:52:55,378 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:189)
2012-01-11 10:52:55,378 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:155)
2012-01-11 10:52:55,379 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:235)
2012-01-11 10:52:55,379 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:736)
Expected behavior is URL should get added to index with proper response code instead of marking it as failed with response code as 0.
This is a bug in Nutch 1.0 and should be fixed as explained here https://issues.apache.org/jira/browse/NUTCH-862
This issue was found in a production crawl S-1910 for issue URL http://www.shl.com/atlantic