spark-rapids [BUG] GpuGetJsonObject does not expand escaped characters

Describe the bug If I have a string with escaped characters in it, the cpu version of GetJsonObject will interpret those escaped characters, but the GPU version will not.

For example if I have the file test.tsv.

label	json
1	{"A": "B"}
2	{"A": "B\nB"}
3	{"A": "\u7CFB\u7D71"}
4	{"A": "\u7CFB\t\u7D71"}

If I run the following commands I get different results on the CPU and the GPU.

scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("*", "get_json_object(json, '$.A')").show(false)
+-----+-----------------------+--------------------------+
|label|json                   |get_json_object(json, $.A)|
+-----+-----------------------+--------------------------+
|1    |{"A": "B"}             |B                         |
|2    |{"A": "B\nB"}          |B\nB                      |
|3    |{"A": "\u7CFB\u7D71"}  |\u7CFB\u7D71              |
|4    |{"A": "\u7CFB\t\u7D71"}|\u7CFB\t\u7D71            |
+-----+-----------------------+--------------------------+

scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("*", "get_json_object(json, '$.A')").show(false)
+-----+-----------------------+--------------------------+
|label|json                   |get_json_object(json, $.A)|
+-----+-----------------------+--------------------------+
|1    |{"A": "B"}             |B                         |
|2    |{"A": "B\nB"}          |B\nB                      |
|3    |{"A": "\u7CFB\u7D71"}  |系統                      |
|4    |{"A": "\u7CFB\t\u7D71"}|系\t統                    |
+-----+-----------------------+--------------------------+

But don't let the '\n' and \t' fool you. That is just show cleaning things up for us and re-escaping the results.

scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("get_json_object(json, '$.A') as parsed").collect.foreach(System.out.println)
[B]
[B\nB]
[\u7CFB\u7D71]
[\u7CFB\t\u7D71]

scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("get_json_object(json, '$.A') as parsed").collect.foreach(System.out.println)
[B]
[B
B]
[系統]
[系	統]

I don't think it would be too hard to post process the returned data, but it might be better to do it in the get_json_object kernel itself, that way we can tell if the data is in quotes or not. We might also need it to properly get the key, if the key is escaped in some way.

Aug 14 '23 14:08 revans2

I found the code that handles escapes in the JSON parser that Spark uses. It is not in a simple place to link to because Github has a size limit and this file is crazy large.

@Override
    protected char _decodeEscaped() throws IOException
    {
        if (_inputPtr >= _inputEnd) {
            if (!_loadMore()) {
                _reportInvalidEOF(" in character escape sequence", JsonToken.VALUE_STRING);
            }
        }
        int c = (int) _inputBuffer[_inputPtr++];

        switch (c) {
            // First, ones that are mapped
        case 'b':
            return '\b';
        case 't':
            return '\t';
        case 'n':
            return '\n';
        case 'f':
            return '\f';
        case 'r':
            return '\r';

            // And these are to be returned as they are
        case '"':
        case '/':
        case '\\':
            return (char) c;

        case 'u': // and finally hex-escaped
            break;

        default:
            return _handleUnrecognizedCharacterEscape((char) _decodeCharForError(c));
        }

        // Ok, a hex escape. Need 4 characters
        int value = 0;
        for (int i = 0; i < 4; ++i) {
            if (_inputPtr >= _inputEnd) {
                if (!_loadMore()) {
                    _reportInvalidEOF(" in character escape sequence", JsonToken.VALUE_STRING);
                }
            }
            int ch = _inputBuffer[_inputPtr++];
            int digit = CharTypes.charToHex(ch);
            if (digit < 0) {
                _reportUnexpectedChar(ch & 0xFF, "expected a hex-digit for character escape sequence");
            }
            value = (value << 4) | digit;
        }
        return (char) value;
    }

It is in https://raw.githubusercontent.com/FasterXML/jackson-core/2ef15d37c75c33b68a1c5325f23b20c3b2a128d3/src/main/java/com/fasterxml/jackson/core/json/UTF8StreamJsonParser.java

Oct 04 '23 19:10 revans2

Hi @nvdbaranec, I suppose you don't work on this issue so I assign to Chong. Please let me know if it's not correct.

Mar 05 '24 05:03 GaryShen2008

You are correct.

Mar 05 '24 19:03 nvdbaranec

Will be fixed by PR: https://github.com/NVIDIA/spark-rapids-jni/pull/1868

Mar 20 '24 00:03 res-life