cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] JSON white space normalization removes too much for unquoted values

Open revans2 opened this issue 1 year ago • 2 comments

Describe the bug I don't consider this to be too critical, but it is a regression compared to not turning on white space normalization.

An unquoted JSON value is not allowed to have white space in the middle of it, but it looks like the white space normalization is cleaning it up, and converting it from invalid JSON to valid JSON, which makes some of my tests fail.

Steps/Code to reproduce bug

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Boolean_Spaces)
{
  std::string input  = R"({"A": tr ue })";
  // but it actually is {"A":true}
  std::string output = R"({"A":tr ue})";
  run_test(input, output);
}

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Float_Spaces)
{
  std::string input  = R"({"A": 1 . 0 })";
  // but it actually is {"A":1.0}
  std::string output = R"({"A":1 . 0})";
  run_test(input, output);
}

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Sci_Spaces)
{
  std::string input  = R"({"A": 1 E 1 })";
  // but it actually is {"A":1E1}
  std::string output = R"({"A":1 E 1})";
  run_test(input, output);
}

The above tests fail, because there are too many white space values being removed.

revans2 avatar Mar 12 '24 18:03 revans2

@revans2 thank you for sharing this update - it's hilarious! We will need to brainstorm about our options here.

  • Let invalid JSON get corrected. @revans2 are you willing to accept this behavior?
  • libcudf could modify the normalization FST to avoid dropping the whitespace that occurs within a JSON value. We would need to add more states to do this, so it would cost more time and memory. @shrshi what do you think?
  • Spark-RAPIDS could run a validity check and flag invalid rows, but perhaps this would not work for the cases where Spark needs to recover records from invalid lines (e.g. {"A":1}{"B":1 2 3}). @revans2 what do you think?

GregoryKimball avatar Mar 12 '24 22:03 GregoryKimball

Sorry it took me so long to respond to this. I would want the white space removal to follow what CUDF already does for validation with where it ignored white space.

https://www.json.org/json-en.html

Provides some nice flow charts. So for an object white space after a { can be removed. As can any white space before or after a key.

For an array white space after a [ can be removed (but that is probably already covered by value.

For a value any leasing or trailing white space can be removed, but white space within a number, true, false, null, or string cannot be.

Also white space consists of space, linefeed, carriage return or horizontal tab.

Another option would be to remove the white space afterwards, but we would have to have a way to know that a row that was returned came from a nested object and not a regular value, and we would need a way for the DFA to work on a String column instead of a buffer.

revans2 avatar Jun 25 '24 15:06 revans2