[BUG] GetJsonObject does not process escape sequences in returned strings or queries
Describe the bug
GetJsonObject on the CPU, when returning a string will process any of the escape sequences and turn them into the desired output.
-
\" => " -
\' => ' -
\/ => / -
\\ => \ -
\b =>ASCII CHAR 0x08 -
\f =>ASCII CHAR 0x0B -
\n =>ASCII CHAR 0x0A -
\t =>ASCII CHAR 0x09
It also does not process these for the query string, or the keys when looking for a query string, and it does not like unescaped control characters in the keys at all.
Note that it can be a little hard to tell if this is happening or not, because show in spark adds back in a lot of the escapes. It is also hard to put in an escape sequence without scala processing it.
Steps/Code to reproduce bug For queries and query strings.
scala> Seq("""{"t\t":"t"}""", "{'t\t':'t'}").toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t\t') as t1", "get_json_object(jsonstr,'$.t\\t') as t2").show()
+----+----+
| t1| t2|
+----+----+
|null|null|
|null|null|
+----+----+
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> Seq("""{"t\t":"t"}""", "{'t\t':'t'}").toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t\t') as t1", "get_json_object(jsonstr,'$.t\\t') as t2").show()
+---+---+
| t1| t2|
+---+---+
| t| t|
| t| t|
+---+---+
For escaped values in the result.
scala> val data = Seq("""{"t":"\""}""","""{"t":'\"'}""","""{"t":"\'"}""","""{"t":'\''}""","""{"t":"\/"}""","""{"t":"\\"}""","""{"t":"\b"}""","""{"t":"\f"}""","""{"t":"\n"}""","""{"t":"\t"}""")
scala> data.toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t') as t").collect.foreach{ row => System.err.println(Option(row.getString(0)).map(s => s.getBytes("UTF8").toList))}
Some(List(92, 34))
None
None
None
Some(List(92, 47))
Some(List(92, 92))
Some(List(92, 98))
Some(List(92, 102))
Some(List(92, 110))
Some(List(92, 116))
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> data.toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t') as t").collect.foreach{ row => System.err.println(Option(row.getString(0)).map(s => s.getBytes("UTF8").toList))}
Some(List(34))
Some(List(34))
Some(List(39))
Some(List(39))
Some(List(47))
Some(List(92))
Some(List(8))
Some(List(12))
Some(List(10))
Some(List(9))
Expected behavior We match Spark much exactly in these cases.
Actually I have done some more research and spark does not support escape sequences at all in JSON path queries. It just keeps the quoted string as is. So $['\''] is actually an invalid path because it only looks until it sees the next single quote character.
Is this similar to #9033 ?
Will be fixed by PR: https://github.com/NVIDIA/spark-rapids-jni/pull/1868