spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] Special character inputs from the csv file bring inconsistency when using the CPU and GPU engines

Open asddfl opened this issue 2 months ago • 1 comments

Describe the bug Special character inputs from the csv file bring inconsistency when using the CPU and GPU engines respectively.

Steps/Code to reproduce bug

import os
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ['SPARK_HOME'] = "./spark-4.0.1-bin-hadoop3"
os.environ['PYSPARK_SUBMIT_ARGS'] = "--jars ./rapids-4-spark_2.13-25.10.0.jar,./cudf-25.10.0-cuda12.jar --master local[*] pyspark-shell"

import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql import functions as F

spark = SparkSession.builder.appName('SparkRAPIDS').config('spark.plugins','com.nvidia.spark.SQLPlugin').config("spark.executor.memory", "8g").config("spark.driver.memory", "8g").getOrCreate()
spark.sparkContext.addPyFile('./rapids-4-spark_2.13-25.10.0.jar')
spark.sparkContext.addPyFile('./cudf-25.10.0-cuda12.jar')
spark.conf.set('spark.rapids.sql.incompatibleOps.enabled', 'true')
spark.conf.set('spark.rapids.sql.format.csv.read.enabled', 'true')
spark.conf.set('spark.rapids.sql.format.csv.enabled', 'true')
spark.conf.set("spark.executor.resource.gpu.amount", "1")
spark.conf.set("spark.task.resource.gpu.amount", "1")
spark.conf.set("spark.rapids.sql.concurrentGpuTasks", "1")
spark.conf.set("spark.rapids.sql.exec.CollectLimitExec", "true")
spark.conf.set('spark.rapids.sql.enabled', 'false')

df = spark.read.csv("t1.csv", header=True, inferSchema=True)
df.createOrReplaceTempView("t1")

sql_cpu_result = spark.sql("SELECT * FROM t1;")
print("SQL CPU:")
sql_cpu_result.show(truncate=False)

spark.conf.set('spark.rapids.sql.enabled', 'true')
sql_gpu_result = spark.sql("SELECT * FROM t1;")
print("SQL GPU:")
sql_gpu_result.show(truncate=False)

spark.conf.set('spark.rapids.sql.enabled', 'false')
print("API CPU:")
api_cpu_result = spark.table("t1")
api_cpu_result.show(truncate=False)

spark.conf.set('spark.rapids.sql.enabled', 'true')
print("API GPU:")
api_gpu_result = spark.table("t1")
api_gpu_result.show(truncate=False)

t1.csv:

c0
">{FP""`5;"

Result:

SQL CPU:
+-----------+
|c0         |
+-----------+
|">{FP""`5;"|
+-----------+

SQL GPU:
+--------+
|c0      |
+--------+
|>{FP"`5;|
+--------+

API CPU:
+-----------+
|c0         |
+-----------+
|">{FP""`5;"|
+-----------+

API GPU:
+--------+
|c0      |
+--------+
|>{FP"`5;|
+--------+

Expected behavior The result is consistent when using the CPU and GPU engines.

Environment details (please complete the following information)

  • Environment location: Standalone
  • Spark configuration settings related to the issue: java-17-openjdk-amd64 spark-4.0.1-bin-hadoop3 rapids-4-spark_2.13-25.10.0 cudf-25.10.0-cuda12

Additional context Add any other context about the problem here.

asddfl avatar Nov 06 '25 12:11 asddfl

Hi @asddfl ( @asdsql ? ), cudf and Spark handle quotes in CSV files differently, which is what you identified. We are working to ensure the RAPIDS Spark plugin matches Apache Spark as closely as possible. How critical of a fix is this for your use case?

sameerz avatar Nov 18 '25 21:11 sameerz

Similar to cuDF issue: https://github.com/rapidsai/cudf/issues/20812, but the case in this issue is not convered by the linked cuDF issue.

res-life avatar Dec 18 '25 07:12 res-life