spark-rapids Use new getJsonObject kernel for json

This PR updates json_tuple with new getJsonObject kernel.

All current xfailed cases got passed:

./integration_tests/run_pyspark_from_build.sh -s -k json_tuple
......
============= 16 passed, 13 xpassed, 8 warnings in 49.36s ============

I think the performance will not be good because it calls getJsonObject kernel many times, which is not very fast by itself.

With the new json_parser in jni, I think we can implement a kernel for json_tuple to get much higher performance by passing all fields in one pass. So this PR will be a short-term workaround, even if it gets merged.

Depends on https://github.com/NVIDIA/spark-rapids-jni/pull/1893

Mar 26 '24 08:03 thirtiseven

a quick perf test:

val data = Seq.fill(3000000)("""{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}""")

import spark.implicits._
data.toDF("a").write.mode("overwrite").parquet("JSON")

val df = spark.read.parquet("JSON")

spark.conf.set("spark.rapids.sql.expression.JsonTuple", true)

spark.time(df.selectExpr("json_tuple(a, 'store', 'reader', 'bicycle', 'owner', 'zip code', 'email')").count())

6 fields: CPU: 65216 ms GPU: 7851 ms

1 fields: CPU: 68050 ms GPU: 2070 ms

Wow, so it is actually quite fast. Not sure if I tested it right.

Mar 26 '24 15:03 thirtiseven

Wow, so it is actually quite fast. Not sure if I tested it right.

A bit of feedback on the quick test.

Looks like you only ran it once. Cold runs are usually a lot slower than hot runs, but even then.
All of your data is the same. Which means that there is no thread divergence in the GPU.
You don't mention what CPU/system was used so it is hard to tell if it is a fair comparison or not.
It would be nice to see how fast the parquet read is compared to the CPU. all of the performance gains might be in that, just because it is a single string column repeated 3,000,000 times.

Mar 26 '24 17:03 revans2

Might also be nice to have a follow on issue to see if we can drop the special field name checks.

Updated, the special field name checks are safe to drop.

Mar 27 '24 06:03 thirtiseven

build

Mar 28 '24 09:03 thirtiseven

build

Apr 01 '24 07:04 thirtiseven

build

Apr 24 '24 14:04 thirtiseven

Verified again to generate doc. Seems that ./build/buildall only generates docs per shims, not the main ones.

Apr 24 '24 14:04 thirtiseven

build

Apr 24 '24 14:04 thirtiseven

build

Apr 25 '24 02:04 thirtiseven

spark-rapids
spark-rapids copied to clipboard

Use new getJsonObject kernel for json_tuple

spark-rapids spark-rapids copied to clipboard

Use new getJsonObject kernel for json_tuple

spark-rapids
spark-rapids copied to clipboard