[BUG]25.08,0 com.nvidia.spark.rapids.RapidsHostColumnVector cannot be cast to com.nvidia.spark.rapids.GpuColumnVector
Describe the bug A clear and concise description of what the bug is.
Customer is seeing the following exception:
2025-11-25T04:56:17.420919734Z stderr F Previous exception in task: com.nvidia.spark.rapids.RapidsHostColumnVector cannot be cast to com.nvidia.spark.rapids.GpuColumnVector |
2025-11-25T04:56:17.420921825Z stderr F com.nvidia.spark.rapids.GpuColumnVector.getTotalDeviceMemoryUsed(GpuColumnVector.java:1115) |
2025-11-25T04:56:17.420923882Z stderr F com.nvidia.spark.rapids.spill.SpillableColumnarBatchHandle$.apply(SpillFramework.scala:1835) |
2025-11-25T04:56:17.42092599Z stderr F com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:372) |
2025-11-25T04:56:17.420928009Z stderr F com.nvidia.spark.rapids.GpuSortEachBatchIterator.$anonfun$next$1(GpuSortExec.scala:183)
2025-11-25T04:56:17.420930048Z stderr F com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:98)
2025-11-25T04:56:17.420932144Z stderr F com.nvidia.spark.rapids.GpuSortEachBatchIterator.next(GpuSortExec.scala:182)
2025-11-25T04:56:17.4209341Z stderr F com.nvidia.spark.rapids.GpuSortEachBatchIterator.next(GpuSortExec.scala:168)
2025-11-25T04:56:17.42093613Z stderr F org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:392)
2025-11-25T04:56:17.420938193Z stderr F org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:414)
2025-11-25T04:56:17.420940235Z stderr F org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
2025-11-25T04:56:17.420942294Z stderr F org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
2025-11-25T04:56:17.420944249Z stderr F org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
2025-11-25T04:56:17.420946262Z stderr F org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
2025-11-25T04:56:17.420948338Z stderr F org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
2025-11-25T04:56:17.420950388Z stderr F org.apache.spark.scheduler.Task.run(Task.scala:141)
2025-11-25T04:56:17.420952715Z stderr F org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
2025-11-25T04:56:17.420954728Z stderr F org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
2025-11-25T04:56:17.42095679Z stderr F org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
2025-11-25T04:56:17.42095887Z stderr F org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
2025-11-25T04:56:17.420970522Z stderr F org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
2025-11-25T04:56:17.420972734Z stderr F java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2025-11-25T04:56:17.420975621Z stderr F java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Also seeing a lot of:
2025-11-25T04:56:17.420589836Z stderr F java.lang.IllegalStateException: Close called too many times HostColumnVector{rows=2, type=STRING, nullCount=Optional[0], offHeap=(ID: 58795)}
Steps/Code to reproduce bug Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.
Expected behavior A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
- Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
- Spark configuration settings related to the issue
Spark 3.5.3 on k8s using Spark Rapids 25.08.0
Additional context Add any other context about the problem here.
this is occuring in stage that does: GPU Parquet scan -> InMemoryTableScan -> AdaptiveSparkPlan -> GpuRapidsDeltaWrite -> GpuColumnToRow
== Physical Plan ==
GpuColumnarToRow (7)
+- GpuRapidsDeltaWrite (6)
+- AdaptiveSparkPlan (5)
+- == Final Plan ==
TableCacheQueryStage (4), Statistics(sizeInBytes=2046.0 B, rowCount=13)
+- InMemoryTableScan (1)
+- InMemoryRelation (2)
+- GpuScan parquet (3)
+- == Initial Plan ==
InMemoryTableScan (1)
+- InMemoryRelation (2)
+- GpuScan parquet (3)
Might be related to https://github.com/NVIDIA/spark-rapids/pull/13434
Here are the steps to repro this locally
val data = Range(1,1000).toDF()
data.write.parquet("/tmp/data.parquet")
val df = spark.read.parquet("/tmp/data.parquet")
df.cache.count
df.write.format("delta").mode("overwrite").save("/tmp/delta_data")