[BUG] GpuGetJsonObject does not expand escaped characters
Describe the bug If I have a string with escaped characters in it, the cpu version of GetJsonObject will interpret those escaped characters, but the GPU version will not.
For example if I have the file test.tsv.
label json
1 {"A": "B"}
2 {"A": "B\nB"}
3 {"A": "\u7CFB\u7D71"}
4 {"A": "\u7CFB\t\u7D71"}
If I run the following commands I get different results on the CPU and the GPU.
scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("*", "get_json_object(json, '$.A')").show(false)
+-----+-----------------------+--------------------------+
|label|json |get_json_object(json, $.A)|
+-----+-----------------------+--------------------------+
|1 |{"A": "B"} |B |
|2 |{"A": "B\nB"} |B\nB |
|3 |{"A": "\u7CFB\u7D71"} |\u7CFB\u7D71 |
|4 |{"A": "\u7CFB\t\u7D71"}|\u7CFB\t\u7D71 |
+-----+-----------------------+--------------------------+
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("*", "get_json_object(json, '$.A')").show(false)
+-----+-----------------------+--------------------------+
|label|json |get_json_object(json, $.A)|
+-----+-----------------------+--------------------------+
|1 |{"A": "B"} |B |
|2 |{"A": "B\nB"} |B\nB |
|3 |{"A": "\u7CFB\u7D71"} |系統 |
|4 |{"A": "\u7CFB\t\u7D71"}|系\t統 |
+-----+-----------------------+--------------------------+
But don't let the '\n' and \t' fool you. That is just show cleaning things up for us and re-escaping the results.
scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("get_json_object(json, '$.A') as parsed").collect.foreach(System.out.println)
[B]
[B\nB]
[\u7CFB\u7D71]
[\u7CFB\t\u7D71]
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("get_json_object(json, '$.A') as parsed").collect.foreach(System.out.println)
[B]
[B
B]
[系統]
[系 統]
I don't think it would be too hard to post process the returned data, but it might be better to do it in the get_json_object kernel itself, that way we can tell if the data is in quotes or not. We might also need it to properly get the key, if the key is escaped in some way.
I found the code that handles escapes in the JSON parser that Spark uses. It is not in a simple place to link to because Github has a size limit and this file is crazy large.
@Override
protected char _decodeEscaped() throws IOException
{
if (_inputPtr >= _inputEnd) {
if (!_loadMore()) {
_reportInvalidEOF(" in character escape sequence", JsonToken.VALUE_STRING);
}
}
int c = (int) _inputBuffer[_inputPtr++];
switch (c) {
// First, ones that are mapped
case 'b':
return '\b';
case 't':
return '\t';
case 'n':
return '\n';
case 'f':
return '\f';
case 'r':
return '\r';
// And these are to be returned as they are
case '"':
case '/':
case '\\':
return (char) c;
case 'u': // and finally hex-escaped
break;
default:
return _handleUnrecognizedCharacterEscape((char) _decodeCharForError(c));
}
// Ok, a hex escape. Need 4 characters
int value = 0;
for (int i = 0; i < 4; ++i) {
if (_inputPtr >= _inputEnd) {
if (!_loadMore()) {
_reportInvalidEOF(" in character escape sequence", JsonToken.VALUE_STRING);
}
}
int ch = _inputBuffer[_inputPtr++];
int digit = CharTypes.charToHex(ch);
if (digit < 0) {
_reportUnexpectedChar(ch & 0xFF, "expected a hex-digit for character escape sequence");
}
value = (value << 4) | digit;
}
return (char) value;
}
It is in https://raw.githubusercontent.com/FasterXML/jackson-core/2ef15d37c75c33b68a1c5325f23b20c3b2a128d3/src/main/java/com/fasterxml/jackson/core/json/UTF8StreamJsonParser.java
Hi @nvdbaranec, I suppose you don't work on this issue so I assign to Chong. Please let me know if it's not correct.
You are correct.
Will be fixed by PR: https://github.com/NVIDIA/spark-rapids-jni/pull/1868