[BUG] Special character inputs from the csv file bring inconsistency when using the CPU and GPU engines
Describe the bug Special character inputs from the csv file bring inconsistency when using the CPU and GPU engines respectively.
Steps/Code to reproduce bug
import os
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ['SPARK_HOME'] = "./spark-4.0.1-bin-hadoop3"
os.environ['PYSPARK_SUBMIT_ARGS'] = "--jars ./rapids-4-spark_2.13-25.10.0.jar,./cudf-25.10.0-cuda12.jar --master local[*] pyspark-shell"
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql import functions as F
spark = SparkSession.builder.appName('SparkRAPIDS').config('spark.plugins','com.nvidia.spark.SQLPlugin').config("spark.executor.memory", "8g").config("spark.driver.memory", "8g").getOrCreate()
spark.sparkContext.addPyFile('./rapids-4-spark_2.13-25.10.0.jar')
spark.sparkContext.addPyFile('./cudf-25.10.0-cuda12.jar')
spark.conf.set('spark.rapids.sql.incompatibleOps.enabled', 'true')
spark.conf.set('spark.rapids.sql.format.csv.read.enabled', 'true')
spark.conf.set('spark.rapids.sql.format.csv.enabled', 'true')
spark.conf.set("spark.executor.resource.gpu.amount", "1")
spark.conf.set("spark.task.resource.gpu.amount", "1")
spark.conf.set("spark.rapids.sql.concurrentGpuTasks", "1")
spark.conf.set("spark.rapids.sql.exec.CollectLimitExec", "true")
spark.conf.set('spark.rapids.sql.enabled', 'false')
df = spark.read.csv("t1.csv", header=True, inferSchema=True)
df.createOrReplaceTempView("t1")
sql_cpu_result = spark.sql("SELECT * FROM t1;")
print("SQL CPU:")
sql_cpu_result.show(truncate=False)
spark.conf.set('spark.rapids.sql.enabled', 'true')
sql_gpu_result = spark.sql("SELECT * FROM t1;")
print("SQL GPU:")
sql_gpu_result.show(truncate=False)
spark.conf.set('spark.rapids.sql.enabled', 'false')
print("API CPU:")
api_cpu_result = spark.table("t1")
api_cpu_result.show(truncate=False)
spark.conf.set('spark.rapids.sql.enabled', 'true')
print("API GPU:")
api_gpu_result = spark.table("t1")
api_gpu_result.show(truncate=False)
t1.csv:
c0
">{FP""`5;"
Result:
SQL CPU:
+-----------+
|c0 |
+-----------+
|">{FP""`5;"|
+-----------+
SQL GPU:
+--------+
|c0 |
+--------+
|>{FP"`5;|
+--------+
API CPU:
+-----------+
|c0 |
+-----------+
|">{FP""`5;"|
+-----------+
API GPU:
+--------+
|c0 |
+--------+
|>{FP"`5;|
+--------+
Expected behavior The result is consistent when using the CPU and GPU engines.
Environment details (please complete the following information)
- Environment location: Standalone
- Spark configuration settings related to the issue: java-17-openjdk-amd64 spark-4.0.1-bin-hadoop3 rapids-4-spark_2.13-25.10.0 cudf-25.10.0-cuda12
Additional context Add any other context about the problem here.
Hi @asddfl ( @asdsql ? ), cudf and Spark handle quotes in CSV files differently, which is what you identified. We are working to ensure the RAPIDS Spark plugin matches Apache Spark as closely as possible. How critical of a fix is this for your use case?
Similar to cuDF issue: https://github.com/rapidsai/cudf/issues/20812, but the case in this issue is not convered by the linked cuDF issue.