[BUG] GetJsonObject does not validate the input is JSON in the same way as Spark
Describe the bug The current GPU implementation of GetJsonObject does not check if the JSON data is valid. The CPU version uses a JSON parser that allows single quotes and unescaped control characters.
https://github.com/apache/spark/blob/a3266b411723310ec10fc1843ddababc15249db0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L108-L114
If there are any errors when parsing the data then the result is converted to a null.
https://github.com/apache/spark/blob/a3266b411723310ec10fc1843ddababc15249db0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L267
We probably need to do at least some validation that the data is correct.
Steps/Code to reproduce bug
scala> val df = Seq("""{"url":"http://test.com"}""","{'url':'http://test.com'}","{'url':'http://test.com\t'}", """{"url":"http://"http://test.com"","something":1234"}""","""{"url":"http://test.com","something":1234, ""","""{"url":"http://test.com",}""","""{"url":"http://test.com",,}""","""[{"url": "http://test.com"}]""").toDF("jsonstr")
df: org.apache.spark.sql.DataFrame = [jsonstr: string]
scala> df.repartition(1).selectExpr("get_json_object(jsonstr, '$.url') as url").show(false)
+---------------+
|url |
+---------------+
|http://test.com|
|null |
|null |
|http:// |
|http://test.com|
|http://test.com|
|http://test.com|
|null |
+---------------+
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> df.repartition(1).selectExpr("get_json_object(jsonstr, '$.url') as url").show(false)
+-----------------+
|url |
+-----------------+
|http://test.com |
|http://test.com |
|http://test.com\t|
|null |
|null |
|null |
|null |
|null |
+-----------------+
Expected behavior We produce the same results as Spark on the CPU.
Be aware that this is a little odd with how in handles escape sequence validation too. The only characters that I have found that can handle an escape are ",', /; \, b, f, n, r, t all other characters appear to cause the validation to fail.
PR: https://github.com/NVIDIA/spark-rapids-jni/pull/1868 will fix this issue.