Details
-
Type: Bug
-
Status: Resolved
-
Priority: Blocker
-
Resolution: Cannot Reproduce
-
Affects Version/s: Cloud Spider 3.02
-
Fix Version/s: Cloud Spider 3.03
-
Component/s: Cloud Spider
-
Labels:None
Description
Large crawls (like 2 Million) processing large volume of data (few hundreds of GB) fails with OutOfMemory Error. One such error was observed in a production crawl (S-1050) with 2Million URLs, which failed at index creation step (which was processing 250+ GB of data)
One cause of this appears to be the volume of data generated in the index creation step. Nutch dumps the whole crawl data content during this step, while we clearly are not indexing the page content. Changing this should reduce the size of intermediate files generated and should prevent the MapReduce jobs from exceeding memory limits.
Also tweaking hadoop parameters such as io.sort.mb, mapred.child.java.opts, could be useful,
----Log Output----
Wed 2011/09/14 14:45:23.196| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:23 INFO mapred.Merger: Merging 10 intermediate segments out of a total of 54
Wed 2011/09/14 14:45:28.279| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:28 INFO mapred.LocalJobRunner: reduce > sort
Wed 2011/09/14 14:45:49.289| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:49 INFO mapred.LocalJobRunner: reduce > sort
Wed 2011/09/14 14:45:52.671| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:52 INFO mapred.LocalJobRunner: reduce > sort
Wed 2011/09/14 14:45:55.671| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:55 INFO mapred.LocalJobRunner: reduce > sort
Wed 2011/09/14 14:46:05.138| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:46:05 INFO mapred.LocalJobRunner: reduce > sort
Wed 2011/09/14 14:47:09.486| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:47:09 WARN mapred.LocalJobRunner: job_local_0048
Wed 2011/09/14 14:47:09.486| |Thread-51|StreamGobbler|STDERR: java.lang.OutOfMemoryError: Java heap space
Wed 2011/09/14 14:47:09.487| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:315)
Wed 2011/09/14 14:47:09.487| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:377)
Wed 2011/09/14 14:47:09.487| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:174)
Wed 2011/09/14 14:47:09.488| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:277)
Wed 2011/09/14 14:47:09.488| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:297)
Wed 2011/09/14 14:47:09.488| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:110)
Wed 2011/09/14 14:47:09.488| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:433)
Wed 2011/09/14 14:47:09.489| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:326)
Wed 2011/09/14 14:47:09.489| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger.merge(Merger.java:58)
Wed 2011/09/14 14:47:09.489| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:384)
Wed 2011/09/14 14:47:09.490| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
Wed 2011/09/14 14:47:10.323| |Thread-51|StreamGobbler|STDERR: Exception in thread "main" java.io.IOException: Job failed!