Dima
Dima copied to clipboard
Dima throws "Size exceeds Integer.MAX_VALUE" exception when processing the LiveJournal dataset
Hi! Sorry for disturbing you with another bug report.
Recently, when I tried to run Dima on the com-LiveJournal dataset from SNAP Datasets, I met another exception.
The exception is
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4367 in stage 6.0 failed 4 times, most recent failure: Lost task 4367.3 in stage 6.0 (TID 61683, slave004): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Details of my running:
- I run Dima with this Scala script to conduct a Jaccard similarity self-join on the adjacency sets of the Livejournal graph.
-
spark.sql.joins.numSimialrityPartitions
is 10240. - Threshold: 0.9.
- The preprocessed dataset file can be downloaded from here with the access password
KmSSjq
. - Spark job configuration:
- Number of executors: 32;
- Executor cores: 8;
- Executor memory: 20 GB;
- Driver memory: 10 GB;
Details of the dataset:
- Number of records: 3997962;
- Average record length: 17.3.
The number of records per partition on average is 390.4 (=3997962/10240), which is not very big. It seems that some RDD partition becomes too large during the execution. Do I need to change some parameters to enable Dima to run on this dataset?
Thanks you very much for inspecting this bug report!