spark [SPARK-38292][PYTHON]Support na_filter for pyspark.pandas.read

What changes were proposed in this pull request?

na filter is added in the read csv option . This is similar to na filter option in pandas

data.csv

A,B,C
,val1,val2
val3

from pyspark import pandas as ps
import pandas as pd
ps.read_csv("data.csv")

      A     B     C
0  None  val1  val2
1  val3  None  None

pd.read_csv("data.csv")

      A     B     C
0   NaN  val1  val2
1  val3   NaN   NaN

ps.read_csv("data.csv", na_filter=False)

      A     B     C
0        val1  val2
1  val3            
  
pd.read_csv("data.csv", na_filter=False)

      A     B     C
0        val1  val2
1  val3

Why are the changes needed?

Added na_filter option

Does this PR introduce any user-facing change?

yes

How was this patch tested?

Unit test cases

Jun 27 '22 17:06 pralabhkumar

Can one of the admins verify this patch?

Jun 28 '22 20:06 AmplabJenkins

@HyukjinKwon Please review the PR

Jun 30 '22 05:06 pralabhkumar

@HyukjinKwon Gentle ping

Jul 04 '22 03:07 pralabhkumar

@itholic can you review this one please?

Jul 05 '22 04:07 HyukjinKwon

Could you make some refine the PR description ??

Seems like the format is broken in somewhere so it's a bit hard to understand the example and purpose.

Jul 05 '22 06:07 itholic

And also fix the title ?? I think we can just use the JIRA title, "Support na_filter for pyspark.pandas.read_csv" as is.

Jul 05 '22 06:07 itholic

@itholic Have done the changes as suggested

Jul 08 '22 02:07 pralabhkumar

@itholic @HyukjinKwon Gentle ping

Jul 12 '22 06:07 pralabhkumar

Gentle ping to please review the PR @itholic

Jul 15 '22 05:07 pralabhkumar

@itholic Please review the PR

Jul 18 '22 08:07 pralabhkumar

Hey, I think the fix here is too hacky. Can we make this working independently with other options being set?

Jul 20 '22 02:07 HyukjinKwon

Hey, I think the fix here is too hacky. Can we make this working independently with other options being set?

Hi @HyukjinKwon

Thx for reviewing . Had again gone through the code. Here is my understanding (same is mentioned in jira)

IMHO Setting nullValue option will not help here . Since whatever the value we set string ,, will be converted to the value by Univocityparser(external) which we had set. For e.g A,, and if setNullValue(“B”) will result to A,B by univocity parser . Then Spark Univocity parser in nullSafeDatum Will always convert to null (since datum == options.nullValue) . So output will always be A, null whereas we need A,,

https://github.com/apache/spark/blob/ffa82c219029a7f6f3caf613dd1d0ab56d0c599e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L274

.There IMHO setting nullValue will not help here unless we have options.naFilter value to False which will make sure the above condition doesn't satisfy.

Now in case of missing values in the beginning and end of line , current logic in convert method of UnivocietyParser is to go into exception

https://github.com/apache/spark/blob/ffa82c219029a7f6f3caf613dd1d0ab56d0c599e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L343

row.update(i, requiredSchema.existenceDefaultValues(i)) and update with default value. Now we don’t want values to set to null in case options.naFilter is false.

Please let me know , if u think otherwise , i'll try to incorporate the same .

Jul 21 '22 16:07 pralabhkumar

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Nov 05 '22 00:11 github-actions[bot]

spark
spark copied to clipboard

[SPARK-38292][PYTHON]Support na_filter for pyspark.pandas.read_csv

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark spark copied to clipboard

[SPARK-38292][PYTHON]Support na_filter for pyspark.pandas.read_csv

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard