cudf
cudf copied to clipboard
[FEA] JSON reader: Provide option to treat quoted strings as null values
Is your feature request related to a problem? Please describe. This is part of https://github.com/NVIDIA/spark-rapids/issues/9
In order to be consistent with Spark when reading JSON on the GPU, we would like to ask cuDF to read non-string primitive values as strings and then cast them to the required type. This approach already works well for valid inputs but we do not have a way to treat quoted strings as null to match Spark's behavior.
Here is an example JSON input to demonstrate the problem.
{ "number": true }
{ "number": "true" }
The first entry is a valid JSON boolean value and the second entry is a JSON string. If we ask cuDF to read this attribute as a string then we get the same value in both cases. Spark would treat the second entry as invalid and return null.
Describe the solution you'd like There are a few possible approaches to this:
- Have a way to read the raw value without any parsing, so the resulting column will include the quotes
- Add the ability to ask cuDF to read non-string values as strings but to interpret any values that are JSON strings (quoted) as null
- Return all the data as strings but also include a bitmask indicating which ones were quoted.
Describe alternatives you've considered None
Additional context None
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Moved this to P1 as it is a corner case that is not that common. It would be very nice to be able to support this at some point sooner than later.
Have a way to read the raw value without any parsing, so the resulting column will include the quotes
In this case, is it okay to keep the quotes on the values in the actual string columns as well?
Have a way to read the raw value without any parsing, so the resulting column will include the quotes
In this case, is it okay to keep the quotes on the values in the actual string columns as well?
If we keep the quotes then we will have to perform an additional transformation in the plugin to remove them so this doesn't seem ideal.
If we can get the raw string value (without quotes) and an indication of whether the value was quoted or not then I think we have everything we need.
Once #11574 is merged, the new nested JSON reader (currently available as experimental
) will introduce an option to keep_quotes
that will retain the quote characters on string values. Would this sufficiently address this feature request?
Otherwise, I would need to better understand the expected behaviour.
Here is an example JSON input to demonstrate the problem.
{ "number": true } { "number": "true" }
The first entry is a valid JSON boolean value and the second entry is a JSON string. If we ask cuDF to read this attribute as a string then we get the same value in both cases. Spark would treat the second entry as invalid and return null.
Is this a mixup, or would spark really treat the second value as null
?
I think having a mapping of a tuple of (target_type
, JSON_type
) -> {valid, invalid}
, where for JSON_type
we distinguish between {string-value, non-string value}
.
Yes keep_quotes would do what we want.
@revans2 the keep_quotes
option is now merged. Can we close this issue? We can always reopen if the implementation is not sufficient.
Sure