spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] GpuGetJsonObject does not expand escaped characters

Open revans2 opened this issue 2 years ago • 4 comments

Describe the bug If I have a string with escaped characters in it, the cpu version of GetJsonObject will interpret those escaped characters, but the GPU version will not.

For example if I have the file test.tsv.

label	json
1	{"A": "B"}
2	{"A": "B\nB"}
3	{"A": "\u7CFB\u7D71"}
4	{"A": "\u7CFB\t\u7D71"}

If I run the following commands I get different results on the CPU and the GPU.

scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("*", "get_json_object(json, '$.A')").show(false)
+-----+-----------------------+--------------------------+
|label|json                   |get_json_object(json, $.A)|
+-----+-----------------------+--------------------------+
|1    |{"A": "B"}             |B                         |
|2    |{"A": "B\nB"}          |B\nB                      |
|3    |{"A": "\u7CFB\u7D71"}  |\u7CFB\u7D71              |
|4    |{"A": "\u7CFB\t\u7D71"}|\u7CFB\t\u7D71            |
+-----+-----------------------+--------------------------+

scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("*", "get_json_object(json, '$.A')").show(false)
+-----+-----------------------+--------------------------+
|label|json                   |get_json_object(json, $.A)|
+-----+-----------------------+--------------------------+
|1    |{"A": "B"}             |B                         |
|2    |{"A": "B\nB"}          |B\nB                      |
|3    |{"A": "\u7CFB\u7D71"}  |系統                      |
|4    |{"A": "\u7CFB\t\u7D71"}|系\t統                    |
+-----+-----------------------+--------------------------+

But don't let the '\n' and \t' fool you. That is just show cleaning things up for us and re-escaping the results.

scala> spark.conf.set("spark.rapids.sql.enabled", true)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("get_json_object(json, '$.A') as parsed").collect.foreach(System.out.println)
[B]
[B\nB]
[\u7CFB\u7D71]
[\u7CFB\t\u7D71]

scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> spark.read.options(Map("sep"->"\t", "header"->"true")).csv("./test.tsv").selectExpr("get_json_object(json, '$.A') as parsed").collect.foreach(System.out.println)
[B]
[B
B]
[系統]
[系	統]

I don't think it would be too hard to post process the returned data, but it might be better to do it in the get_json_object kernel itself, that way we can tell if the data is in quotes or not. We might also need it to properly get the key, if the key is escaped in some way.

revans2 avatar Aug 14 '23 14:08 revans2

I found the code that handles escapes in the JSON parser that Spark uses. It is not in a simple place to link to because Github has a size limit and this file is crazy large.

@Override
    protected char _decodeEscaped() throws IOException
    {
        if (_inputPtr >= _inputEnd) {
            if (!_loadMore()) {
                _reportInvalidEOF(" in character escape sequence", JsonToken.VALUE_STRING);
            }
        }
        int c = (int) _inputBuffer[_inputPtr++];

        switch (c) {
            // First, ones that are mapped
        case 'b':
            return '\b';
        case 't':
            return '\t';
        case 'n':
            return '\n';
        case 'f':
            return '\f';
        case 'r':
            return '\r';

            // And these are to be returned as they are
        case '"':
        case '/':
        case '\\':
            return (char) c;

        case 'u': // and finally hex-escaped
            break;

        default:
            return _handleUnrecognizedCharacterEscape((char) _decodeCharForError(c));
        }

        // Ok, a hex escape. Need 4 characters
        int value = 0;
        for (int i = 0; i < 4; ++i) {
            if (_inputPtr >= _inputEnd) {
                if (!_loadMore()) {
                    _reportInvalidEOF(" in character escape sequence", JsonToken.VALUE_STRING);
                }
            }
            int ch = _inputBuffer[_inputPtr++];
            int digit = CharTypes.charToHex(ch);
            if (digit < 0) {
                _reportUnexpectedChar(ch & 0xFF, "expected a hex-digit for character escape sequence");
            }
            value = (value << 4) | digit;
        }
        return (char) value;
    }

It is in https://raw.githubusercontent.com/FasterXML/jackson-core/2ef15d37c75c33b68a1c5325f23b20c3b2a128d3/src/main/java/com/fasterxml/jackson/core/json/UTF8StreamJsonParser.java

revans2 avatar Oct 04 '23 19:10 revans2

Hi @nvdbaranec, I suppose you don't work on this issue so I assign to Chong. Please let me know if it's not correct.

GaryShen2008 avatar Mar 05 '24 05:03 GaryShen2008

You are correct.

nvdbaranec avatar Mar 05 '24 19:03 nvdbaranec

Will be fixed by PR: https://github.com/NVIDIA/spark-rapids-jni/pull/1868

res-life avatar Mar 20 '24 00:03 res-life