cudf
cudf copied to clipboard
Prototype get_json_object
Description
This is prototype of spark get_json_object_multiple_paths using libcudf json reader.
It can read named, and indexed path. WILDCARD (unsupported) will give wrong results for now. Float results will be different from sparks API.
Depends on #15968 (validation of values)
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
If I try to run get_json_object_multiple_paths2 with just $ as the path, which is the same as an empty parsed path vector I get what looks like I get the wrong type back and I don't really understand why. The following has some debug logging added in.
scala> Seq("""{"a":100}""","""{"b":200}""").toDF("json").repartition(1).selectExpr("*", "get_json_object(json, '$.a') as a", "get_json_object(json, '$.b') as b", "get_json_object(json, '$') as dollar").show()
...
GPU COLUMN OUTPUT 0 - NC: 1 DATA: DeviceMemoryBufferView{address=0x40a003e00, length=3, id=-1} VAL: DeviceMemoryBufferView{address=0x40a004000, length=64, id=-1}
...
GPU COLUMN OUTPUT 1 - NC: 1 DATA: DeviceMemoryBufferView{address=0x40a002400, length=3, id=-1} VAL: DeviceMemoryBufferView{address=0x40a002600, length=64, id=-1}
....
GPU COLUMN OUTPUT 2 - NC: 0 DATA: null VAL: null
GPU COLUMN OUTPUT 2:CHILD_0 - NC: 1 DATA: DeviceMemoryBufferView{address=0x40a002a00, length=3, id=-1} VAL: DeviceMemoryBufferView{address=0x40a002c00, length=64, id=-1}
GPU COLUMN OUTPUT 2:CHILD_1 - NC: 1 DATA: DeviceMemoryBufferView{address=0x40a003000, length=3, id=-1} VAL: DeviceMemoryBufferView{address=0x40a003200, length=64, id=-1}
The column returned for $ is a struct with two string children in it, each is the output of the previous two columns. The output is supposed to be a normalized and verified version of the input.
scala> Seq("""{"a":100}""","""{"b":200}""").toDF("json").repartition(1).selectExpr("*", "get_json_object(json, '$.a') as a", "get_json_object(json, '$.b') as b", "get_json_object(json, '$') as dollar").show()
+---------+----+----+---------+
| json| a| b| dollar|
+---------+----+----+---------+
|{"a":100}| 100|null|{"a":100}|
|{"b":200}|null| 200|{"b":200}|
+---------+----+----+---------+
If I try to parse something with an array as the top level I get an illegal memory access error.
scala> Seq("""{"a":100}""","""[{"b":200}]""").toDF("json").repartition(1).selectExpr("*", "get_json_object(json, '$.a') as a", "get_json_object(json, '$.b') as b").show()
...
24/08/15 15:52:14 ERROR Executor: Exception in task 0.0 in stage 17.0 (TID 17)
ai.rapids.cudf.CudaFatalException: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
at com.nvidia.spark.rapids.jni.JSONUtils.getJsonObjectMultiplePaths(Native Method)
at com.nvidia.spark.rapids.jni.JSONUtils.getJsonObjectMultiplePaths(JSONUtils.java:132)
at com.nvidia.spark.rapids.GpuMultiGetJsonObject.$anonfun$columnarEval$6(GpuGetJsonObject.scala:219)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
But it works just fine if there is no array at the top level.
scala> Seq("""{"a":100}""","""{"b":200}""").toDF("json").repartition(1).selectExpr("*", "get_json_object(json, '$.a') as a", "get_json_object(json, '$.b') as b").show()
...
+---------+----+----+
| json| a| b|
+---------+----+----+
|{"a":100}| 100|null|
|{"b":200}|null| 200|
+---------+----+----+
This is kind of important because I have seen this in real customer data. In fact I have seen several queries that expect this $[0] for example.
I think you already know this also, but I found that I could not have two paths that were ambiguous in the same call. I could not have one that thought an item was an array and another that thought it was an object.
On a happy note I was able to get a 1.6x speedup on one of my queries with this patch compared to the version in spark-rapids-jni. But I did have to comment out a lot and I have not tried to validate the results yet.
I believe this was just a POC, so I'll close it since we can always view the PR later.