lz4-java icon indicating copy to clipboard operation
lz4-java copied to clipboard

Dead lock in LZ4Factory

Open patelh opened this issue 5 years ago • 3 comments

We are using version 1.5.1 and we've seen multiple instances where we have dead lock in LZ4Factory. Spark pipeline hangs and we end up killing it and restarting. Doesn't happen all the time. In this case, we see 8 threads blocked.

"shuffle-server-5-4" #183 daemon prio=5 os_prio=0 tid=0x00007f45e5369000 nid=0x1fe6 waiting for monitor entry [0x00007f45910d4000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at net.jpountz.lz4.LZ4Factory.nativeInstance(LZ4Factory.java:83)
	- waiting to lock <0x00000003c2f8ddf8> (a java.lang.Class for net.jpountz.lz4.LZ4Factory)
	at net.jpountz.lz4.LZ4Factory.fastestInstance(LZ4Factory.java:157)
	at net.jpountz.lz4.LZ4BlockOutputStream.<init>(LZ4BlockOutputStream.java:138)
	at org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:117)
	at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:156)
	at org.apache.spark.serializer.SerializerManager.dataSerializeWithExplicitClassTag(SerializerManager.scala:193)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:610)
	at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:585)
	at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:585)
"shuffle-server-5-3" #173 daemon prio=5 os_prio=0 tid=0x00007f45e53a0000 nid=0x1eee waiting for monitor entry [0x00007f45924d9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at net.jpountz.lz4.LZ4Factory.nativeInstance(LZ4Factory.java:83)
	- waiting to lock <0x00000003c2f8ddf8> (a java.lang.Class for net.jpountz.lz4.LZ4Factory)
	at net.jpountz.lz4.LZ4Factory.fastestInstance(LZ4Factory.java:157)
	at net.jpountz.lz4.LZ4BlockOutputStream.<init>(LZ4BlockOutputStream.java:138)
	at org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:117)
	at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:156)
	at org.apache.spark.serializer.SerializerManager.dataSerializeWithExplicitClassTag(SerializerManager.scala:193)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:610)
	at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:585)
	at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:585)

patelh avatar Nov 26 '19 22:11 patelh

For the time being, we've overcome this by caching the instance via backport of https://github.com/apache/spark/pull/24905/files to Spark 2.3.2

patelh avatar Nov 26 '19 22:11 patelh

I have not yet understood what was happening. Can I have the full thread dump at the time a deadlock happened?

odaira avatar Nov 27 '19 22:11 odaira

The threads are waiting on the same monitor. We haven't been able to root cause this yet. It is possible the JVM was in bad state due to heap corruption. We've upgraded java to latest release to see if we get different behavior.

patelh avatar Feb 27 '20 08:02 patelh