spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] GetJsonObject does not process escape sequences in returned strings or queries

Open revans2 opened this issue 2 years ago • 3 comments

Describe the bug GetJsonObject on the CPU, when returning a string will process any of the escape sequences and turn them into the desired output.

  • \" => "
  • \' => '
  • \/ => /
  • \\ => \
  • \b => ASCII CHAR 0x08
  • \f => ASCII CHAR 0x0B
  • \n => ASCII CHAR 0x0A
  • \t => ASCII CHAR 0x09

It also does not process these for the query string, or the keys when looking for a query string, and it does not like unescaped control characters in the keys at all.

Note that it can be a little hard to tell if this is happening or not, because show in spark adds back in a lot of the escapes. It is also hard to put in an escape sequence without scala processing it.

Steps/Code to reproduce bug For queries and query strings.

scala> Seq("""{"t\t":"t"}""", "{'t\t':'t'}").toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t\t') as t1", "get_json_object(jsonstr,'$.t\\t') as t2").show()

+----+----+
|  t1|  t2|
+----+----+
|null|null|
|null|null|
+----+----+


scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> Seq("""{"t\t":"t"}""", "{'t\t':'t'}").toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t\t') as t1", "get_json_object(jsonstr,'$.t\\t') as t2").show()
+---+---+
| t1| t2|
+---+---+
|  t|  t|
|  t|  t|
+---+---+

For escaped values in the result.

scala> val data = Seq("""{"t":"\""}""","""{"t":'\"'}""","""{"t":"\'"}""","""{"t":'\''}""","""{"t":"\/"}""","""{"t":"\\"}""","""{"t":"\b"}""","""{"t":"\f"}""","""{"t":"\n"}""","""{"t":"\t"}""")

scala> data.toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t') as t").collect.foreach{ row => System.err.println(Option(row.getString(0)).map(s => s.getBytes("UTF8").toList))}

Some(List(92, 34))
None
None
None
Some(List(92, 47))
Some(List(92, 92))
Some(List(92, 98))
Some(List(92, 102))
Some(List(92, 110))
Some(List(92, 116))

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> data.toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t') as t").collect.foreach{ row => System.err.println(Option(row.getString(0)).map(s => s.getBytes("UTF8").toList))}
Some(List(34))
Some(List(34))
Some(List(39))
Some(List(39))
Some(List(47))
Some(List(92))
Some(List(8))
Some(List(12))
Some(List(10))
Some(List(9))

Expected behavior We match Spark much exactly in these cases.

revans2 avatar Jan 12 '24 19:01 revans2

Actually I have done some more research and spark does not support escape sequences at all in JSON path queries. It just keeps the quoted string as is. So $['\''] is actually an invalid path because it only looks until it sees the next single quote character.

revans2 avatar Jan 18 '24 17:01 revans2

Is this similar to #9033 ?

sameerz avatar Jan 20 '24 04:01 sameerz

Will be fixed by PR: https://github.com/NVIDIA/spark-rapids-jni/pull/1868

res-life avatar Mar 19 '24 01:03 res-life