spark-rapids [FEA] JSON input support

Is your feature request related to a problem? Please describe.

High Priority

`GetJsonObject`/`get_json_object`

get_json_object is being traced by a separate epic. One of them is in the spark-rapids repo and another is in spark-rapids-jni. The JNI version is trying to write a new parser from scratch to match what Spark is doing.

[ ] https://github.com/NVIDIA/spark-rapids/issues/10254
[ ] https://github.com/NVIDIA/spark-rapids-jni/issues/1823

`JsonTuple`/`json_tuple`

For now it is implemented in terms of multiple calls to get_json_object. This is likely to change in the future, but for now it should be tracked with get_json_object

`JsonToStructs` and `ScanJson`

JsonToStructs and ScanJson share a common backend and most bugs in one are reflected in the other with the exception of a few issues that are specific to how JsonToStructs prepares it's input so that the CUDF JSON parser can handle it.

[ ] https://github.com/NVIDIA/spark-rapids/issues/4609
- [ ] https://github.com/NVIDIA/spark-rapids/issues/4940
- [ ] https://github.com/rapidsai/cudf/issues/15318
[ ] https://github.com/NVIDIA/spark-rapids/issues/10485
[ ] https://github.com/NVIDIA/spark-rapids/issues/10453
[ ] https://github.com/NVIDIA/spark-rapids/issues/7616
[ ] https://github.com/NVIDIA/spark-rapids/issues/10483
[ ] #4608
- [x] https://github.com/rapidsai/cudf/issues/8827
- [ ] https://github.com/rapidsai/cudf/issues/15278
- [ ] https://github.com/rapidsai/cudf/issues/15260
- [ ] https://github.com/rapidsai/cudf/issues/14239
[ ] https://github.com/NVIDIA/spark-rapids/issues/8558
- [ ] https://github.com/rapidsai/cudf/issues/15278
[ ] https://github.com/NVIDIA/spark-rapids/issues/10489
- [ ] https://github.com/rapidsai/cudf/issues/15277
[ ] #4612
- [ ] https://github.com/rapidsai/cudf/issues/15222
[ ] https://github.com/NVIDIA/spark-rapids/issues/7482
[x] https://github.com/NVIDIA/spark-rapids/issues/10486
[ ] https://github.com/NVIDIA/spark-rapids/issues/10479
- [ ] https://github.com/rapidsai/cudf/issues/15222
[ ] https://github.com/NVIDIA/spark-rapids/issues/10595
[ ] https://github.com/NVIDIA/spark-rapids/issues/10596
- [ ] https://github.com/rapidsai/cudf/issues/15303
[ ] https://github.com/NVIDIA/spark-rapids/issues/7482

Medium Priority

`JsonToStructs` and `ScanJson`

[ ] https://github.com/NVIDIA/spark-rapids/issues/10488
[ ] https://github.com/NVIDIA/spark-rapids/issues/4951
[x] #4718
[ ] #4616
[ ] https://github.com/NVIDIA/spark-rapids/issues/9774
[x] https://github.com/NVIDIA/spark-rapids/issues/9750
[ ] https://github.com/NVIDIA/spark-rapids/issues/10458
[ ] https://github.com/NVIDIA/spark-rapids/issues/10457
[ ] https://github.com/NVIDIA/spark-rapids/issues/10534
[ ] https://github.com/NVIDIA/spark-rapids/issues/10535
[ ] https://github.com/NVIDIA/spark-rapids/issues/10607
[ ] https://github.com/NVIDIA/spark-rapids/issues/10573
[ ] https://github.com/NVIDIA/spark-rapids/issues/10574
[ ] https://github.com/NVIDIA/spark-rapids/issues/10588

Low Priority

`JsonToStructs` and `ScanJson`

[ ] #4610
[ ] JSON reader: Support lineSep configuration option
[ ] JSON reader: Support multiLine configuration option
[ ] JSON reader: Support primitivesAsString configuration option
[ ] JSON reader: Support prefersDecimal configuration option
[ ] https://github.com/NVIDIA/spark-rapids/issues/9664
[ ] https://github.com/NVIDIA/spark-rapids/issues/9723
[ ] https://github.com/NVIDIA/spark-rapids/issues/9724
[ ] https://github.com/NVIDIA/spark-rapids/issues/10495
[ ] https://github.com/NVIDIA/spark-rapids/issues/10493
[ ] https://github.com/NVIDIA/spark-rapids/issues/10539
[ ] https://github.com/NVIDIA/spark-rapids/issues/10587
- [ ] https://github.com/rapidsai/cudf/issues/10266

Testing

`JsonToStructs` and `ScanJson`

[ ] https://github.com/NVIDIA/spark-rapids/issues/10491
[ ] https://github.com/NVIDIA/spark-rapids/issues/10044
[ ] https://github.com/NVIDIA/spark-rapids/issues/10492

May 28 '20 17:05 revans2

We have had a customer ask about this, so we might bump up the priority on this. I have been looking at the JSON parsing and how that relates to the existing CSV parsing. Just like CSV parsing it is not great, but I think we could do with JSON what we want to do with CSV and parse all of the atomic types as Strings and then handle casting/parsing them ourselves. This will make the code a lot more robust in terms of Spark compatibility. But there are still a number of issues that we have to look into and probably address.

The current CUDF code does not support nested types for parsing JSON. https://github.com/rapidsai/cudf/issues/8827 should fix that, but it is not done yet.
The current CUDF code parses types how they want to, and not how Spark wants them to be parsed. Like I said above we might be able to ask CUDF to read all of the types as Strings and then parse them ourselves. This will not fix everything because looking at cast there are a number of types that we do not fully support when casting from a string still. But it will be a step in the right direction and should let us avoid most of the enable this type for JSON configs. We could at a minimum reused the cast from string configs that already exist.
"allowSingleQuotes" option. This is set to true by default so Spark allows parsing values with single quotes instead of just double quotes. The CUDF code does not. So we probably want to file an issue with CUDF to ask for this, and also in the short term document that we cannot support this.
There are also a number of configs in Spark that we will need to play around with in CUDF, but I am fairly sure if they are enabled we will have to fall back to the CPU, or at least really document well the inconsistencies. These include a. "allowComments" C/C++ style comments // and /* */ b. "allowUnquotesFieldNames" c. "allowNumericLeadingZeros" Here the parser strips the leading zeros for numeric JSON types instead of marking them as corrupt entries, so if I had something like {"a": 0001, "b": 01.1000000000} both entries would cause the row to be marked as corrupt. Even if we provide a schema an ask for both a and b to be strings they show up as nulls because the values are technically corrupt. If I set the option to true they are parsed, but the leading and in the case of the float, trailing zeros are stripped from the resulting string. d. "allowNonNumericNumbers" The docs say that this mean INF, -INF, and NaN but I could not make INF work, just -INF. Not sure how well used this is. e. "allowBackslashEscapingAnyCharacter" typically only a few chars in the JSON standard are allowed. In CUDF they only support ", \, \t, \r, and \b. Not sure if there are others in JSON or not. Also not sure what happens if CUDF if others are encountered vs in Spark. f. "allowUnquotedControlChars" Docs say ASCII characters with a value less than 32. My guess is CUDF just allows these all the time.
The "parseMode" and "columnNameOfCorruptRecord" parse mode config probably needs to always be lenient and we need to fall back to the CPU like with CSV if we ever see the columnNameOfCorruptRecord in the schema. We just don't support that and I don't know what it would take for CUDF to be able to do it.
"ignoreNullFields" and "dropFieldIfAllNull" are things we might be able to do with post processing, but we need to play around with it a little.
IF we see multi-line we have to fall back to the CPU. We just cannot support it.
We need to look at encodings and see if we have to fall back to the CPU or not. I hope that we can normalize things as we read it in like we do with the line ending in CSV and will have to do with the line ending here, but I am not 100% sure on that.
We need to check the zoneId like with CSV and fallback if it is not UTC
Need to check the date/timestamp formats too and see if there is anything in there that we cannot support.
We also need to play around with JSON parsing in general and see what falls out. It looks like Spark is fairly strict in parsing and there can be odd corner cases. For example it looks like unquoted number values, if they are going to be interpreted as a string, are parsed as numbers first and then cast to a string, So 3.00 becomes 3.0 but "3.00" remains unchanged. I don't see any way we can match this behavior and probably will just need to document it.

Nov 03 '21 18:11 revans2

Oh, also a single line can contain multiple entries if the top level is an array.

[{"a": 1, "b": 1}, {"a": 2}]
[{"a": 1, "b": 2.0}]
{"a": 1, "b": "inf"}

produces

+---+----+
|  a|   b|
+---+----+
|  1|   1|
|  2|null|
|  1| 2.0|
|  1| inf|
+---+----+

But if there is extra JSON like stuff at the end of the line, it is ignored.

{"a": 1, "b": 1}, {"other": 2}
{"a": 1, "b": "inf"} garbage
{"a": 1, "b": 1} {"more": 2}

produces the following with no errors. Which feels really odd to me.

+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|inf|
|  1|  1|
+---+---+

There may be a lot of other odd cases, that we need to look into.

Nov 03 '21 18:11 revans2

We might want to look at some of the Spark JSON tests, but they are not that complete.

Nov 03 '21 18:11 revans2

Oh that is interesting. The parsing of the JSON keys is case sensitive, but auto detection of the schema is not totally so you can get errors if you let Spark detect the schema and there are keys with different cases. i.e. A vs a. So we should test if we can select the keys in a case sensitive way.

Also if there are multiple keys in a record. For spark it looks like the last one wins. Not sure what CUDF does in those cases.

{"a": 1, "b": 1}
{"a": 1, "a": 2}
{"a": 1, "b": 1}

produces

+---+----+
|  a|   b|
+---+----+
|  1|   1|
|  2|null|
|  1|   1|
+---+----+

with no errors

Nov 03 '21 18:11 revans2

Added Needs Triage back on so we can look at this again because I think most of the analysis of this is done.

Nov 03 '21 18:11 revans2

spark-rapids spark-rapids copied to clipboard

[FEA] JSON input support

High Priority

GetJsonObject/get_json_object

JsonTuple/json_tuple

JsonToStructs and ScanJson

Medium Priority

JsonToStructs and ScanJson

Low Priority

JsonToStructs and ScanJson

Testing

JsonToStructs and ScanJson

spark-rapids
spark-rapids copied to clipboard

`GetJsonObject`/`get_json_object`

`JsonTuple`/`json_tuple`

`JsonToStructs` and `ScanJson`

`JsonToStructs` and `ScanJson`

`JsonToStructs` and `ScanJson`

`JsonToStructs` and `ScanJson`