spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] GetJsonObject does not validate the input is JSON in the same way as Spark

Open revans2 opened this issue 2 years ago • 2 comments

Describe the bug The current GPU implementation of GetJsonObject does not check if the JSON data is valid. The CPU version uses a JSON parser that allows single quotes and unescaped control characters.

https://github.com/apache/spark/blob/a3266b411723310ec10fc1843ddababc15249db0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L108-L114

If there are any errors when parsing the data then the result is converted to a null.

https://github.com/apache/spark/blob/a3266b411723310ec10fc1843ddababc15249db0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L267

We probably need to do at least some validation that the data is correct.

Steps/Code to reproduce bug

scala> val df = Seq("""{"url":"http://test.com"}""","{'url':'http://test.com'}","{'url':'http://test.com\t'}", """{"url":"http://"http://test.com"","something":1234"}""","""{"url":"http://test.com","something":1234, ""","""{"url":"http://test.com",}""","""{"url":"http://test.com",,}""","""[{"url": "http://test.com"}]""").toDF("jsonstr")
df: org.apache.spark.sql.DataFrame = [jsonstr: string]

scala> df.repartition(1).selectExpr("get_json_object(jsonstr, '$.url') as url").show(false)

+---------------+
|url            |
+---------------+
|http://test.com|
|null           |
|null           |
|http://        |
|http://test.com|
|http://test.com|
|http://test.com|
|null           |
+---------------+


scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> df.repartition(1).selectExpr("get_json_object(jsonstr, '$.url') as url").show(false)
+-----------------+
|url              |
+-----------------+
|http://test.com  |
|http://test.com  |
|http://test.com\t|
|null             |
|null             |
|null             |
|null             |
|null             |
+-----------------+

Expected behavior We produce the same results as Spark on the CPU.

revans2 avatar Jan 12 '24 17:01 revans2

Be aware that this is a little odd with how in handles escape sequence validation too. The only characters that I have found that can handle an escape are ",', /; \, b, f, n, r, t all other characters appear to cause the validation to fail.

revans2 avatar Jan 12 '24 19:01 revans2

PR: https://github.com/NVIDIA/spark-rapids-jni/pull/1868 will fix this issue.

res-life avatar Mar 18 '24 09:03 res-life