cudf
cudf copied to clipboard
[BUG] JSON white space normalization removes too much for unquoted values
Describe the bug I don't consider this to be too critical, but it is a regression compared to not turning on white space normalization.
An unquoted JSON value is not allowed to have white space in the middle of it, but it looks like the white space normalization is cleaning it up, and converting it from invalid JSON to valid JSON, which makes some of my tests fail.
Steps/Code to reproduce bug
TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Boolean_Spaces)
{
std::string input = R"({"A": tr ue })";
// but it actually is {"A":true}
std::string output = R"({"A":tr ue})";
run_test(input, output);
}
TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Float_Spaces)
{
std::string input = R"({"A": 1 . 0 })";
// but it actually is {"A":1.0}
std::string output = R"({"A":1 . 0})";
run_test(input, output);
}
TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Sci_Spaces)
{
std::string input = R"({"A": 1 E 1 })";
// but it actually is {"A":1E1}
std::string output = R"({"A":1 E 1})";
run_test(input, output);
}
The above tests fail, because there are too many white space values being removed.
@revans2 thank you for sharing this update - it's hilarious! We will need to brainstorm about our options here.
- Let invalid JSON get corrected. @revans2 are you willing to accept this behavior?
- libcudf could modify the normalization FST to avoid dropping the whitespace that occurs within a JSON value. We would need to add more states to do this, so it would cost more time and memory. @shrshi what do you think?
- Spark-RAPIDS could run a validity check and flag invalid rows, but perhaps this would not work for the cases where Spark needs to recover records from invalid lines (e.g.
{"A":1}{"B":1 2 3}). @revans2 what do you think?
Sorry it took me so long to respond to this. I would want the white space removal to follow what CUDF already does for validation with where it ignored white space.
https://www.json.org/json-en.html
Provides some nice flow charts. So for an object white space after a { can be removed. As can any white space before or after a key.
For an array white space after a [ can be removed (but that is probably already covered by value.
For a value any leasing or trailing white space can be removed, but white space within a number, true, false, null, or string cannot be.
Also white space consists of space, linefeed, carriage return or horizontal tab.
Another option would be to remove the white space afterwards, but we would have to have a way to know that a row that was returned came from a nested object and not a regular value, and we would need a way for the DFA to work on a String column instead of a buffer.