cudf Prototype get_json

Description

This is prototype of spark get_json_object_multiple_paths using libcudf json reader.

It can read named, and indexed path. WILDCARD (unsupported) will give wrong results for now. Float results will be different from sparks API.

Depends on #15968 (validation of values)

Checklist

[x] I am familiar with the Contributing Guidelines.
[ ] New or existing tests cover these changes.
[ ] The documentation is up to date with these changes.

Aug 14 '24 17:08 karthikeyann

If I try to run get_json_object_multiple_paths2 with just $ as the path, which is the same as an empty parsed path vector I get what looks like I get the wrong type back and I don't really understand why. The following has some debug logging added in.

scala> Seq("""{"a":100}""","""{"b":200}""").toDF("json").repartition(1).selectExpr("*", "get_json_object(json, '$.a') as a", "get_json_object(json, '$.b') as b", "get_json_object(json, '$') as dollar").show()
...
GPU COLUMN OUTPUT 0 - NC: 1 DATA: DeviceMemoryBufferView{address=0x40a003e00, length=3, id=-1} VAL: DeviceMemoryBufferView{address=0x40a004000, length=64, id=-1}
...
GPU COLUMN OUTPUT 1 - NC: 1 DATA: DeviceMemoryBufferView{address=0x40a002400, length=3, id=-1} VAL: DeviceMemoryBufferView{address=0x40a002600, length=64, id=-1}
....
GPU COLUMN OUTPUT 2 - NC: 0 DATA: null VAL: null
GPU COLUMN OUTPUT 2:CHILD_0 - NC: 1 DATA: DeviceMemoryBufferView{address=0x40a002a00, length=3, id=-1} VAL: DeviceMemoryBufferView{address=0x40a002c00, length=64, id=-1}
GPU COLUMN OUTPUT 2:CHILD_1 - NC: 1 DATA: DeviceMemoryBufferView{address=0x40a003000, length=3, id=-1} VAL: DeviceMemoryBufferView{address=0x40a003200, length=64, id=-1}

The column returned for $ is a struct with two string children in it, each is the output of the previous two columns. The output is supposed to be a normalized and verified version of the input.

scala> Seq("""{"a":100}""","""{"b":200}""").toDF("json").repartition(1).selectExpr("*", "get_json_object(json, '$.a') as a", "get_json_object(json, '$.b') as b", "get_json_object(json, '$') as dollar").show()
+---------+----+----+---------+
|     json|   a|   b|   dollar|
+---------+----+----+---------+
|{"a":100}| 100|null|{"a":100}|
|{"b":200}|null| 200|{"b":200}|
+---------+----+----+---------+

Aug 15 '24 15:08 revans2

If I try to parse something with an array as the top level I get an illegal memory access error.

scala> Seq("""{"a":100}""","""[{"b":200}]""").toDF("json").repartition(1).selectExpr("*", "get_json_object(json, '$.a') as a", "get_json_object(json, '$.b') as b").show()
...
24/08/15 15:52:14 ERROR Executor: Exception in task 0.0 in stage 17.0 (TID 17)
ai.rapids.cudf.CudaFatalException: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
	at com.nvidia.spark.rapids.jni.JSONUtils.getJsonObjectMultiplePaths(Native Method)
	at com.nvidia.spark.rapids.jni.JSONUtils.getJsonObjectMultiplePaths(JSONUtils.java:132)
	at com.nvidia.spark.rapids.GpuMultiGetJsonObject.$anonfun$columnarEval$6(GpuGetJsonObject.scala:219)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)

But it works just fine if there is no array at the top level.

scala> Seq("""{"a":100}""","""{"b":200}""").toDF("json").repartition(1).selectExpr("*", "get_json_object(json, '$.a') as a", "get_json_object(json, '$.b') as b").show()
...
+---------+----+----+
|     json|   a|   b|
+---------+----+----+
|{"a":100}| 100|null|
|{"b":200}|null| 200|
+---------+----+----+

This is kind of important because I have seen this in real customer data. In fact I have seen several queries that expect this $[0] for example.

Aug 15 '24 15:08 revans2

I think you already know this also, but I found that I could not have two paths that were ambiguous in the same call. I could not have one that thought an item was an array and another that thought it was an object.

Aug 15 '24 16:08 revans2

On a happy note I was able to get a 1.6x speedup on one of my queries with this patch compared to the version in spark-rapids-jni. But I did have to comment out a lot and I have not tried to validate the results yet.

Aug 15 '24 16:08 revans2

I believe this was just a POC, so I'll close it since we can always view the PR later.

May 19 '25 18:05 vyasr

cudf
cudf copied to clipboard

Prototype get_json_object

Description

Checklist

cudf cudf copied to clipboard

Prototype get_json_object

Description

Checklist

cudf
cudf copied to clipboard