spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] JsonToStructs and ScanJson do not normalize numeric output when read as a string

Open revans2 opened this issue 1 year ago • 2 comments

Describe the bug This is almost identical to https://github.com/NVIDIA/spark-rapids/issues/10218, but is for from_json and reading json lines formatted files.

Numbers like 1.00000 and -0 are not normalized to match what Apache Spark would do.

revans2 avatar Feb 21 '24 21:02 revans2

Another odd example of this is +INF and -INF. Even if allowNonNumericNumbers is disabled +INF and -INF are valid floats and are normalized to "Infinity" and "-Infinity" respectively. And the quotes come out in the string itself. This is also true for unquoted Infinity, -Infinity, and NaN

revans2 avatar Feb 22 '24 20:02 revans2

Technically in Spark 4.0 this was reverted (at least for scan by default)

https://issues.apache.org/jira/browse/SPARK-48148

https://github.com/apache/spark/pull/46408

This functionality was put under a config spark.sql.json.enableExactStringParsing with it on by default.

It appears to work for scan, but not for get_json_object. It also does not remove the white space any longer or normalize single quotes, which will make things a lot more interesting to try and make this work.

revans2 avatar Jun 25 '24 16:06 revans2