spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG]25.08,0 com.nvidia.spark.rapids.RapidsHostColumnVector cannot be cast to com.nvidia.spark.rapids.GpuColumnVector

Open tgravescs opened this issue 1 month ago • 4 comments

Describe the bug A clear and concise description of what the bug is.

Customer is seeing the following exception:

2025-11-25T04:56:17.420919734Z stderr F Previous exception in task: com.nvidia.spark.rapids.RapidsHostColumnVector cannot be cast to com.nvidia.spark.rapids.GpuColumnVector                 |
2025-11-25T04:56:17.420921825Z stderr F         com.nvidia.spark.rapids.GpuColumnVector.getTotalDeviceMemoryUsed(GpuColumnVector.java:1115)                                                  |
2025-11-25T04:56:17.420923882Z stderr F         com.nvidia.spark.rapids.spill.SpillableColumnarBatchHandle$.apply(SpillFramework.scala:1835)                                                 |
2025-11-25T04:56:17.42092599Z stderr F  com.nvidia.spark.rapids.SpillableColumnarBatch$.apply(SpillableColumnarBatch.scala:372)                                                              |
2025-11-25T04:56:17.420928009Z stderr F         com.nvidia.spark.rapids.GpuSortEachBatchIterator.$anonfun$next$1(GpuSortExec.scala:183)
2025-11-25T04:56:17.420930048Z stderr F         com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:98)
2025-11-25T04:56:17.420932144Z stderr F         com.nvidia.spark.rapids.GpuSortEachBatchIterator.next(GpuSortExec.scala:182)
2025-11-25T04:56:17.4209341Z stderr F   com.nvidia.spark.rapids.GpuSortEachBatchIterator.next(GpuSortExec.scala:168)
2025-11-25T04:56:17.42093613Z stderr F  org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:392)
2025-11-25T04:56:17.420938193Z stderr F         org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:414)
2025-11-25T04:56:17.420940235Z stderr F         org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
2025-11-25T04:56:17.420942294Z stderr F         org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
2025-11-25T04:56:17.420944249Z stderr F         org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
2025-11-25T04:56:17.420946262Z stderr F         org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
2025-11-25T04:56:17.420948338Z stderr F         org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
2025-11-25T04:56:17.420950388Z stderr F         org.apache.spark.scheduler.Task.run(Task.scala:141)
2025-11-25T04:56:17.420952715Z stderr F         org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
2025-11-25T04:56:17.420954728Z stderr F         org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
2025-11-25T04:56:17.42095679Z stderr F  org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
2025-11-25T04:56:17.42095887Z stderr F  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
2025-11-25T04:56:17.420970522Z stderr F         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
2025-11-25T04:56:17.420972734Z stderr F         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2025-11-25T04:56:17.420975621Z stderr F         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

Also seeing a lot of:

2025-11-25T04:56:17.420589836Z stderr F java.lang.IllegalStateException: Close called too many times HostColumnVector{rows=2, type=STRING, nullCount=Optional[0], offHeap=(ID: 58795)}

Steps/Code to reproduce bug Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.

Expected behavior A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Spark 3.5.3 on k8s using Spark Rapids 25.08.0

Additional context Add any other context about the problem here.

tgravescs avatar Nov 25 '25 15:11 tgravescs

this is occuring in stage that does: GPU Parquet scan -> InMemoryTableScan -> AdaptiveSparkPlan -> GpuRapidsDeltaWrite -> GpuColumnToRow

tgravescs avatar Nov 25 '25 15:11 tgravescs

== Physical Plan ==
GpuColumnarToRow (7)
+- GpuRapidsDeltaWrite (6)
   +- AdaptiveSparkPlan (5)
      +- == Final Plan ==
         TableCacheQueryStage (4), Statistics(sizeInBytes=2046.0 B, rowCount=13)
         +- InMemoryTableScan (1)
               +- InMemoryRelation (2)
                     +- GpuScan parquet  (3)
      +- == Initial Plan ==
         InMemoryTableScan (1)
            +- InMemoryRelation (2)
                  +- GpuScan parquet  (3)

tgravescs avatar Nov 25 '25 15:11 tgravescs

Might be related to https://github.com/NVIDIA/spark-rapids/pull/13434

razajafri avatar Nov 25 '25 19:11 razajafri

Here are the steps to repro this locally

val data = Range(1,1000).toDF()
data.write.parquet("/tmp/data.parquet")
val df = spark.read.parquet("/tmp/data.parquet") 
df.cache.count
df.write.format("delta").mode("overwrite").save("/tmp/delta_data")

razajafri avatar Dec 05 '25 16:12 razajafri