cudf [FEA] JSON reader: Provide option to treat quoted strings as null values

Is your feature request related to a problem? Please describe. This is part of https://github.com/NVIDIA/spark-rapids/issues/9

In order to be consistent with Spark when reading JSON on the GPU, we would like to ask cuDF to read non-string primitive values as strings and then cast them to the required type. This approach already works well for valid inputs but we do not have a way to treat quoted strings as null to match Spark's behavior.

Here is an example JSON input to demonstrate the problem.

{ "number": true }
{ "number": "true" }

The first entry is a valid JSON boolean value and the second entry is a JSON string. If we ask cuDF to read this attribute as a string then we get the same value in both cases. Spark would treat the second entry as invalid and return null.

Describe the solution you'd like There are a few possible approaches to this:

Have a way to read the raw value without any parsing, so the resulting column will include the quotes
Add the ability to ask cuDF to read non-string values as strings but to interpret any values that are JSON strings (quoted) as null
Return all the data as strings but also include a bitmask indicating which ones were quoted.

Describe alternatives you've considered None

Additional context None

Feb 14 '22 18:02 andygrove

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Mar 20 '22 19:03 github-actions[bot]

Moved this to P1 as it is a corner case that is not that common. It would be very nice to be able to support this at some point sooner than later.

May 16 '22 21:05 revans2

Have a way to read the raw value without any parsing, so the resulting column will include the quotes

In this case, is it okay to keep the quotes on the values in the actual string columns as well?

Jun 07 '22 19:06 vuule

Have a way to read the raw value without any parsing, so the resulting column will include the quotes

In this case, is it okay to keep the quotes on the values in the actual string columns as well?

If we keep the quotes then we will have to perform an additional transformation in the plugin to remove them so this doesn't seem ideal.

If we can get the raw string value (without quotes) and an indication of whether the value was quoted or not then I think we have everything we need.

Jun 08 '22 14:06 andygrove

Once #11574 is merged, the new nested JSON reader (currently available as experimental) will introduce an option to keep_quotes that will retain the quote characters on string values. Would this sufficiently address this feature request?

Otherwise, I would need to better understand the expected behaviour.

Here is an example JSON input to demonstrate the problem.
{ "number": true }
{ "number": "true" }
The first entry is a valid JSON boolean value and the second entry is a JSON string. If we ask cuDF to read this attribute as a string then we get the same value in both cases. Spark would treat the second entry as invalid and return null.

Is this a mixup, or would spark really treat the second value as null?

I think having a mapping of a tuple of (target_type, JSON_type) -> {valid, invalid}, where for JSON_type we distinguish between {string-value, non-string value}.

Sep 06 '22 12:09 elstehle

Yes keep_quotes would do what we want.

Sep 08 '22 15:09 revans2

@revans2 the keep_quotes option is now merged. Can we close this issue? We can always reopen if the implementation is not sufficient.

Sep 27 '22 23:09 vuule

Sure

Sep 30 '22 18:09 revans2

cudf cudf copied to clipboard

[FEA] JSON reader: Provide option to treat quoted strings as null values

cudf
cudf copied to clipboard