spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Use new jni kernel for getJsonObject

Open thirtiseven opened this issue 1 year ago • 8 comments

Fixes #10218 Fixes #10212 Fixes #10194 Fixes #10196 Fixes #10537 Fixes #10216 Fixes #10217 Fixes #9033

This PR uses new kernel from https://github.com/NVIDIA/spark-rapids-jni/pull/1893 to replace the implementation in cudf to match Spark's behavior.

This PR is ready for review, but some docs are out of date, will be updated soon.

Use new kernel for json_tuple will be in a separate PR.

perf test

val data = Seq.fill(3000000)("""{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}""")

import spark.implicits._
data.toDF("a").write.mode("overwrite").parquet("JSON")

val df = spark.read.parquet("JSON")

spark.time(df.selectExpr("COUNT(get_json_object(a, '$.store.bicycle')) as pr0", "COUNT(get_json_object(a, '$.store.book[0].non_exist_key')) as pr2", "COUNT(get_json_object(a, '$.store.basket[0][*].b')) as pr3", "COUNT(get_json_object(a, '$.store.book[*].reader')) as pr4", "COUNT(get_json_object(a, '$.store.book[*].category')) as pr5", "COUNT(get_json_object(a, '$.store.basket[*]')) as pr6", "COUNT(get_json_object(a, '$.store.basket[0][*]')) as pr7", "COUNT(get_json_object(a, '$.store.basket[0][2].b')) as pr8", "COUNT(get_json_object(a, '$')) as pr9").show())

cpu: 10649ms jni new kernel: 4820ms cudf no fallback: 1527ms

no nested path similar to customer's usage:

spark.time(df.selectExpr("COUNT(get_json_object(a, '$.owner')) as pr0", "COUNT(get_json_object(a, '$.owner')) as pr2", "COUNT(get_json_object(a, '$.owner')) as pr3", "COUNT(get_json_object(a, '$.owner')) as pr4", "COUNT(get_json_object(a, '$.owner')) as pr5", "COUNT(get_json_object(a, '$.owner')) as pr6", "COUNT(get_json_object(a, '$.owner')) as pr7", "COUNT(get_json_object(a, '$.owner')) as pr8", "COUNT(get_json_object(a, '$.owner')) as pr9").show())

cpu: 1038 ms jni new kernel: 626 ms cudf no fallback: 381 ms

also closes https://github.com/NVIDIA/spark-rapids-jni/issues/1894

thirtiseven avatar Mar 13 '24 10:03 thirtiseven

Please test this: In JNI code, add some printf to ensure the JNI interface is working.

res-life avatar Mar 14 '24 07:03 res-life

Help update doc compatibility.md

The following is a list of known differences.
  * [No input validation](https://github.com/NVIDIA/spark-rapids/issues/10218). If the input string
    is not valid JSON Apache Spark returns a null result, but ours will still try to find a match.
  * [Escapes are not properly processed for Strings](https://github.com/NVIDIA/spark-rapids/issues/10196).
    When returning a result for a quoted string Apache Spark will remove the quotes and replace
    any escape sequences with the proper characters. The escape sequence processing does not happen
    on the GPU.
  * [Invalid JSON paths could throw exceptions](https://github.com/NVIDIA/spark-rapids/issues/10212)
    If a JSON path is not valid Apache Spark returns a null result, but ours may throw an exception
    and fail the query.
  * [Non-string output is not normalized](https://github.com/NVIDIA/spark-rapids/issues/10218)
    When returning a result for things other than strings, a number of things are normalized by
    Apache Spark, but are not normalized by the GPU, like removing unnecessary white space,
    parsing and then serializing floating point numbers, turning single quotes to double quotes,
    and removing unneeded escapes for single quotes.

The following is a list of bugs in either the GPU version or arguably in Apache Spark itself.
   * https://github.com/NVIDIA/spark-rapids/issues/10219 non-matching quotes in quoted strings

res-life avatar Mar 25 '24 02:03 res-life

Should first merge JNI PR https://github.com/NVIDIA/spark-rapids-jni/pull/1893, then merge this.

res-life avatar Mar 25 '24 02:03 res-life

json_tuple was not updated to use the new get_json_object kernel

revans2 avatar Mar 25 '24 15:03 revans2

In my performance tests I see between a 3x and 6x reduction in performance compared to the old implementation, but it does do the right thing most of the time, so I am happy with the results.

revans2 avatar Mar 25 '24 18:03 revans2

json_tuple was not updated to use the new get_json_object kernel

I drafted another PR https://github.com/NVIDIA/spark-rapids/pull/10635 to not let it block this p0 issue, please take a look. All current xfailed cases got passed but I guess the performance will be bad. Will test soon.

thirtiseven avatar Mar 26 '24 08:03 thirtiseven

Depends on https://github.com/NVIDIA/spark-rapids-jni/pull/1893 All current test cases passed now.

Tested locally that cases from https://github.com/NVIDIA/spark-rapids/pull/10604 also passed.

thirtiseven avatar Mar 26 '24 14:03 thirtiseven

Also added some cases related to get json object logic cases from https://github.com/NVIDIA/spark-rapids-jni/pull/1893

thirtiseven avatar Mar 27 '24 10:03 thirtiseven

build

sameerz avatar Mar 27 '24 21:03 sameerz

build

thirtiseven avatar Mar 28 '24 09:03 thirtiseven

I will file a follow-up issue and address comments soon. Merging it now…

thirtiseven avatar Mar 28 '24 12:03 thirtiseven