Dima icon indicating copy to clipboard operation
Dima copied to clipboard

Dima throws "Size exceeds Integer.MAX_VALUE" exception when processing the LiveJournal dataset

Open wangzk opened this issue 7 years ago • 0 comments

Hi! Sorry for disturbing you with another bug report.

Recently, when I tried to run Dima on the com-LiveJournal dataset from SNAP Datasets, I met another exception.

The exception is

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4367 in stage 6.0 failed 4 times, most recent failure: Lost task 4367.3 in stage 6.0 (TID 61683, slave004): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
        at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
        at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
        at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
        at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
        at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
        at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Details of my running:

  • I run Dima with this Scala script to conduct a Jaccard similarity self-join on the adjacency sets of the Livejournal graph.
  • spark.sql.joins.numSimialrityPartitions is 10240.
  • Threshold: 0.9.
  • The preprocessed dataset file can be downloaded from here with the access password KmSSjq.
  • Spark job configuration:
    • Number of executors: 32;
    • Executor cores: 8;
    • Executor memory: 20 GB;
    • Driver memory: 10 GB;

Details of the dataset:

  • Number of records: 3997962;
  • Average record length: 17.3.

The number of records per partition on average is 390.4 (=3997962/10240), which is not very big. It seems that some RDD partition becomes too large during the execution. Do I need to change some parameters to enable Dima to run on this dataset?

Thanks you very much for inspecting this bug report!

wangzk avatar Feb 05 '18 09:02 wangzk