datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Spark executors failing occasionally on SIGSEGV

Open mixermt opened this issue 7 months ago • 6 comments

Hi,

Experience occasional failure of Spark executors

│ # A fatal error has been detected by the Java Runtime Environment:                                                                                                                                                 │
│ #                                                                                                                                                                                                                  │
│ #  SIGSEGV (0xb) at pc=0x00007f079663f84e, pid=18, tid=0x00007f07347ff700                                                                                                                                          │
│ #                                                                                                                                                                                                                  │
│ # JRE version: OpenJDK Runtime Environment (Zulu 8.74.0.17-CA-linux64) (8.0_392-b08) (build 1.8.0_392-b08)                                                                                                         │
│ # Java VM: OpenJDK 64-Bit Server VM (25.392-b08 mixed mode linux-amd64 compressed oops)                                                                                                                            │
│ # Problematic frame:                                                                                                                                                                                               │
│ # V  [libjvm.so+0x8a584e]  MallocSiteTable::malloc_site(unsigned long, unsigned long)+0xe                                                                                                                          │
│ #                                                                                                                                                                                                                  │
│ # Core dump written. Default location: /opt/spark/work-dir/core or core.18                                                                                                                                         │
│ #                                                                                                                                                                                                                  │
│ # An error report file with more information is saved as:                                                                                                                                                          │
│ # /opt/spark/work-dir/hs_err_pid18.log                                                                                                                                                                             │
│ [thread 139669100558080 also had an error]                                                                                                                                                                         │
│ [thread 139669096355584 also had an error]                                                                                                                                                                         │
│ #                                                                                                                                                                                                                  │
│ # If you would like to submit a bug report, please visit:                                                                                                                                                          │
│ #   http://www.azul.com/support/                                                                                                                                                                                   │
│ #        

From Spark UI

ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: 
The executor with id 7 exited with exit code -1(unexpected).

The API gave the following container statuses:
	 container name: spark-executor
	 container image: OUR_SPARK_DOCKER_IMAGE 
	 container state: terminated
	 container started at: 2025-05-04T12:16:23Z
	 container finished at: 2025-05-04T12:17:14Z
	 exit code: 134
	 termination reason: Error

First I thought it sounds like OOM but when I've checked memory graphs of the pods, none of the pods reached even half of requested memory. After number of retries the job succeeded with execution after it switched to another executor. The input bytes or shuffle are really small comparing to allocated executor memory and ofHeap (50g and 30g)

Image

Our env: Spark 3.5.4 - Comet version 0.8.0

Any ideas ?

mixermt avatar May 04 '25 13:05 mixermt

Now I see that the failure occurred in a specific job from our regression set. I will try re-play it multiple times to narrow down the issue.

mixermt avatar May 04 '25 13:05 mixermt

Another observation, failure happens while read of data from Iceberg table.

Either we have some versions mismatch, jar hell or some other still unknown reason to me We are using Iceberg version 1.6.1

Occasionally I see things like

java.lang.IllegalAccessError: tried to access method Å.()V from class org.apache.iceberg.shaded.org.apache.parquet.bytes.SingleBufferInputStream
	at org.apache.iceberg.shaded.org.apache.parquet.bytes.SingleBufferInputStream.(SingleBufferInputStream.java:37)
	at Å.wrap(ByteBufferInputStream.java:38)
	at org.apache.iceberg.shaded.org.apache.parquet.bytes.BytesInput$ByteBufferBytesInput.toInputStream(BytesInput.java:532)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:112)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:139)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:131)
	at org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readPage(ColumnChunkPageReadStore.java:131)
	at org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:59)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator.access$100(VectorizedColumnIterator.java:35)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:75)
	at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:150)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.readDataToColumnVectors(ColumnarBatchReader.java:123)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.loadDataToColumnBatch(ColumnarBatchReader.java:98)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:72)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:44)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:147)
	at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:138)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
	at scala.Option.exists(Option.scala:257)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.hashAgg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)

mixermt avatar May 04 '25 18:05 mixermt

at �.wrap(ByteBufferInputStream.java:38)

Seems like memory got overwritten by some unsafe code which would be consistent with getting a SEGV.

Just to eliminate the possibility, what compression type are you using for the table?

parthchandra avatar May 07 '25 20:05 parthchandra

From Iceberg table properties:

format-version: 2
write.format.default: PARQUET
write.parquet.compression-codec: zstd
write.parquet.compression-level: 1

mixermt avatar May 14 '25 11:05 mixermt

Hmm, I am not aware of any issues related to zstd so that's probably not it. This is clearly an issue but hard to address without being able to reproduce. Would it be possible to provide a minimal repro?

parthchandra avatar May 14 '25 17:05 parthchandra

Hmm, I am not aware of any issues related to zstd so that's probably not it. This is clearly an issue but hard to address without being able to reproduce. Would it be possible to provide a minimal repro?

I'll try to re-produce on clean Spark distro reading the same data, without any internal adjustment we have.

mixermt avatar May 14 '25 20:05 mixermt

Iceberg 1.6.1 is known to cause segfault on Spark 3.5.4. See https://github.com/apache/iceberg/pull/11731 and https://github.com/apache/iceberg/issues/12178.

You can try bumping iceberg version to 1.7.2 and see if this problem still persists.

Kontinuation avatar May 22 '25 02:05 Kontinuation

@huaxingao, fyi.

parthchandra avatar May 23 '25 16:05 parthchandra

It seems like this is a known issue in Iceberg 1.6.1 and not necessarily a Comet issue, so I will close this

andygrove avatar Aug 21 '25 20:08 andygrove