Uploaded image for project: 'AdMax'
  1. AdMax
  2. ADMAX-2839

Cloud Spider: Large Crawls Processing Large Volumes of Data Crashes with OutOfMemory Error

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Cannot Reproduce
    • Affects Version/s: Cloud Spider 3.02
    • Fix Version/s: Cloud Spider 3.03
    • Component/s: Cloud Spider
    • Labels:
      None

      Description

      Large crawls (like 2 Million) processing large volume of data (few hundreds of GB) fails with OutOfMemory Error. One such error was observed in a production crawl (S-1050) with 2Million URLs, which failed at index creation step (which was processing 250+ GB of data)

      One cause of this appears to be the volume of data generated in the index creation step. Nutch dumps the whole crawl data content during this step, while we clearly are not indexing the page content. Changing this should reduce the size of intermediate files generated and should prevent the MapReduce jobs from exceeding memory limits.

      Also tweaking hadoop parameters such as io.sort.mb, mapred.child.java.opts, could be useful,

      ----Log Output----

      Wed 2011/09/14 14:45:23.196| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:23 INFO mapred.Merger: Merging 10 intermediate segments out of a total of 54

      Wed 2011/09/14 14:45:28.279| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:28 INFO mapred.LocalJobRunner: reduce > sort

      Wed 2011/09/14 14:45:49.289| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:49 INFO mapred.LocalJobRunner: reduce > sort

      Wed 2011/09/14 14:45:52.671| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:52 INFO mapred.LocalJobRunner: reduce > sort

      Wed 2011/09/14 14:45:55.671| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:45:55 INFO mapred.LocalJobRunner: reduce > sort

      Wed 2011/09/14 14:46:05.138| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:46:05 INFO mapred.LocalJobRunner: reduce > sort

      Wed 2011/09/14 14:47:09.486| |Thread-51|StreamGobbler|STDERR: 11/09/14 14:47:09 WARN mapred.LocalJobRunner: job_local_0048

      Wed 2011/09/14 14:47:09.486| |Thread-51|StreamGobbler|STDERR: java.lang.OutOfMemoryError: Java heap space

      Wed 2011/09/14 14:47:09.487| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:315)

      Wed 2011/09/14 14:47:09.487| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:377)

      Wed 2011/09/14 14:47:09.487| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:174)

      Wed 2011/09/14 14:47:09.488| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:277)

      Wed 2011/09/14 14:47:09.488| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:297)

      Wed 2011/09/14 14:47:09.488| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:110)

      Wed 2011/09/14 14:47:09.488| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:433)

      Wed 2011/09/14 14:47:09.489| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:326)

      Wed 2011/09/14 14:47:09.489| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.Merger.merge(Merger.java:58)

      Wed 2011/09/14 14:47:09.489| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:384)

      Wed 2011/09/14 14:47:09.490| |Thread-51|StreamGobbler|STDERR: at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)

      Wed 2011/09/14 14:47:10.323| |Thread-51|StreamGobbler|STDERR: Exception in thread "main" java.io.IOException: Job failed!

        Attachments

          Activity

            People

            • Assignee:
              antony Antony Rajiv (Inactive)
              Reporter:
              antony Antony Rajiv (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: