spark
spark copied to clipboard
[SPARK-38292][PYTHON]Support na_filter for pyspark.pandas.read_csv
What changes were proposed in this pull request?
na filter is added in the read csv option . This is similar to na filter option in pandas
data.csv
A,B,C
,val1,val2
val3
from pyspark import pandas as ps
import pandas as pd
ps.read_csv("data.csv")
A B C
0 None val1 val2
1 val3 None None
pd.read_csv("data.csv")
A B C
0 NaN val1 val2
1 val3 NaN NaN
ps.read_csv("data.csv", na_filter=False)
A B C
0 val1 val2
1 val3
pd.read_csv("data.csv", na_filter=False)
A B C
0 val1 val2
1 val3
Why are the changes needed?
Added na_filter option
Does this PR introduce any user-facing change?
yes
How was this patch tested?
Unit test cases
Can one of the admins verify this patch?
@HyukjinKwon Please review the PR
@HyukjinKwon Gentle ping
@itholic can you review this one please?
Could you make some refine the PR description ??
Seems like the format is broken in somewhere so it's a bit hard to understand the example and purpose.
And also fix the title ?? I think we can just use the JIRA title, "Support na_filter for pyspark.pandas.read_csv" as is.
@itholic Have done the changes as suggested
@itholic @HyukjinKwon Gentle ping
Gentle ping to please review the PR @itholic
@itholic Please review the PR
Hey, I think the fix here is too hacky. Can we make this working independently with other options being set?
Hey, I think the fix here is too hacky. Can we make this working independently with other options being set?
Hi @HyukjinKwon
Thx for reviewing . Had again gone through the code. Here is my understanding (same is mentioned in jira)
IMHO Setting nullValue option will not help here . Since whatever the value we set string ,, will be converted to the value by Univocityparser(external) which we had set. For e.g A,, and if setNullValue(“B”) will result to A,B by univocity parser . Then Spark Univocity parser in nullSafeDatum Will always convert to null (since datum == options.nullValue) . So output will always be A, null whereas we need A,,
https://github.com/apache/spark/blob/ffa82c219029a7f6f3caf613dd1d0ab56d0c599e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L274
.There IMHO setting nullValue will not help here unless we have options.naFilter value to False which will make sure the above condition doesn't satisfy.
Now in case of missing values in the beginning and end of line , current logic in convert method of UnivocietyParser is to go into exception
https://github.com/apache/spark/blob/ffa82c219029a7f6f3caf613dd1d0ab56d0c599e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L343
row.update(i, requiredSchema.existenceDefaultValues(i)) and update with default value. Now we don’t want values to set to null in case options.naFilter is false.
Please let me know , if u think otherwise , i'll try to incorporate the same .
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!