spark-rapids
spark-rapids copied to clipboard
Use new jni kernel for getJsonObject
Fixes #10218 Fixes #10212 Fixes #10194 Fixes #10196 Fixes #10537 Fixes #10216 Fixes #10217 Fixes #9033
This PR uses new kernel from https://github.com/NVIDIA/spark-rapids-jni/pull/1893 to replace the implementation in cudf to match Spark's behavior.
This PR is ready for review, but some docs are out of date, will be updated soon.
Use new kernel for json_tuple will be in a separate PR.
perf test
val data = Seq.fill(3000000)("""{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}""")
import spark.implicits._
data.toDF("a").write.mode("overwrite").parquet("JSON")
val df = spark.read.parquet("JSON")
spark.time(df.selectExpr("COUNT(get_json_object(a, '$.store.bicycle')) as pr0", "COUNT(get_json_object(a, '$.store.book[0].non_exist_key')) as pr2", "COUNT(get_json_object(a, '$.store.basket[0][*].b')) as pr3", "COUNT(get_json_object(a, '$.store.book[*].reader')) as pr4", "COUNT(get_json_object(a, '$.store.book[*].category')) as pr5", "COUNT(get_json_object(a, '$.store.basket[*]')) as pr6", "COUNT(get_json_object(a, '$.store.basket[0][*]')) as pr7", "COUNT(get_json_object(a, '$.store.basket[0][2].b')) as pr8", "COUNT(get_json_object(a, '$')) as pr9").show())
cpu: 10649ms jni new kernel: 4820ms cudf no fallback: 1527ms
no nested path similar to customer's usage:
spark.time(df.selectExpr("COUNT(get_json_object(a, '$.owner')) as pr0", "COUNT(get_json_object(a, '$.owner')) as pr2", "COUNT(get_json_object(a, '$.owner')) as pr3", "COUNT(get_json_object(a, '$.owner')) as pr4", "COUNT(get_json_object(a, '$.owner')) as pr5", "COUNT(get_json_object(a, '$.owner')) as pr6", "COUNT(get_json_object(a, '$.owner')) as pr7", "COUNT(get_json_object(a, '$.owner')) as pr8", "COUNT(get_json_object(a, '$.owner')) as pr9").show())
cpu: 1038 ms jni new kernel: 626 ms cudf no fallback: 381 ms
also closes https://github.com/NVIDIA/spark-rapids-jni/issues/1894
Please test this:
In JNI code, add some printf to ensure the JNI interface is working.
Help update doc compatibility.md
The following is a list of known differences.
* [No input validation](https://github.com/NVIDIA/spark-rapids/issues/10218). If the input string
is not valid JSON Apache Spark returns a null result, but ours will still try to find a match.
* [Escapes are not properly processed for Strings](https://github.com/NVIDIA/spark-rapids/issues/10196).
When returning a result for a quoted string Apache Spark will remove the quotes and replace
any escape sequences with the proper characters. The escape sequence processing does not happen
on the GPU.
* [Invalid JSON paths could throw exceptions](https://github.com/NVIDIA/spark-rapids/issues/10212)
If a JSON path is not valid Apache Spark returns a null result, but ours may throw an exception
and fail the query.
* [Non-string output is not normalized](https://github.com/NVIDIA/spark-rapids/issues/10218)
When returning a result for things other than strings, a number of things are normalized by
Apache Spark, but are not normalized by the GPU, like removing unnecessary white space,
parsing and then serializing floating point numbers, turning single quotes to double quotes,
and removing unneeded escapes for single quotes.
The following is a list of bugs in either the GPU version or arguably in Apache Spark itself.
* https://github.com/NVIDIA/spark-rapids/issues/10219 non-matching quotes in quoted strings
Should first merge JNI PR https://github.com/NVIDIA/spark-rapids-jni/pull/1893, then merge this.
json_tuple was not updated to use the new get_json_object kernel
In my performance tests I see between a 3x and 6x reduction in performance compared to the old implementation, but it does do the right thing most of the time, so I am happy with the results.
json_tuple was not updated to use the new get_json_object kernel
I drafted another PR https://github.com/NVIDIA/spark-rapids/pull/10635 to not let it block this p0 issue, please take a look. All current xfailed cases got passed but I guess the performance will be bad. Will test soon.
Depends on https://github.com/NVIDIA/spark-rapids-jni/pull/1893 All current test cases passed now.
Tested locally that cases from https://github.com/NVIDIA/spark-rapids/pull/10604 also passed.
Also added some cases related to get json object logic cases from https://github.com/NVIDIA/spark-rapids-jni/pull/1893
build
build
I will file a follow-up issue and address comments soon. Merging it now…