spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] spark.rapids.sql.exec.CollectLimitExec=true can mess up the CSV header row

Open viadea opened this issue 3 years ago • 0 comments

If we enable spark.rapids.sql.exec.CollectLimitExec=true on a 2 nodes cluster, the CSV with header may be messed up.

For example, let's use this example csv file:

wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv

The format of this csv file is like this:

category,description
Business," Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."
Business," Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market."
Business, Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.

I can reproduce the issue in both databricks and dataproc. Here is the minimum repro on dataproc:

  1. After a 2-nodes Dataproc cluster is ready, ssh to master node
gcloud compute ssh $CLUSTER_NAME-w-0 --project=rapids-spark --zone=$ZONE
wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv
hadoop fs -put news_category_train.csv /tmp/
  1. in spark-shell
spark.conf.set("spark.rapids.sql.exec.CollectLimitExec",true)
spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
  1. run above command couple of times, some times it will show result like:
scala> spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sci/Tech|Scot Wingo, author of eBay Strategies: 10 Proven Methods to Maximize Your eBay Business, will answer reader questions about the online marketplace. Wingo is president and chief executive of ChannelAdvisor, an eBay consignment franchise.|
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Business|                                                                                                                                                                                           Short sellers, Wall Street's dwindling band of...|
|Business|                                                                                                                                                                                           Private investment firm Carlyle Group, which h...|

Sometimes it will show correct result:

scala> spark.read.option("header", true).csv("/tmp/news_category_train.csv").show(5, truncate=50)
+--------+--------------------------------------------------+
|category|                                       description|
+--------+--------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
+--------+--------------------------------------------------+
only showing top 5 rows

Env: I can reproduce using latest 22.10 snapshot and also 22.06GA jar

viadea avatar Oct 14 '22 23:10 viadea