json icon indicating copy to clipboard operation
json copied to clipboard

UTF-8 invalid characters are not always ignored when dumping with error_handler_t::ignore

Open gentooise opened this issue 1 year ago • 9 comments

Description

According to this: https://json.nlohmann.me/api/basic_json/dump/#parameters , when passing error_handler_t::ignore to dump() function, invalid UTF-8 characters should be ignored and copied as-is into the final string.

However, I'm debugging the following minimal code:

    std::string test = "test\334\005";
    nlohmann::json node{};
    node["test"] = test;
    auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

and the final test_dump string contains test\005 (byte \334 is gone).

image

Is this expected? Am I missing something?

Reproduction steps

Just try to run/debug the following:

    std::string test = "test\334\005";
    nlohmann::json node{};
    node["test"] = test;
    auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

Expected vs. actual results

Actual: test_dump contains test\005 Expected: test_dump contains test\334\005

Minimal code example

std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

Error messages

No response

Compiler and operating system

gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924

Library version

3.11.2

Validation

gentooise avatar Dec 17 '24 15:12 gentooise

Yes, there seems to be a logic error. \334 is written to the string buffer, but then overwritten by \u0005. The reason for this is line

bytes = bytes_after_last_accept;

but bytes_after_last_accept was not incremented after reading \334.

nlohmann avatar Dec 17 '24 21:12 nlohmann

Interesting, I'll take a look perhaps.

jordan-hoang avatar Dec 17 '24 22:12 jordan-hoang

I had another look at the issue and this needs some discussion:

  • Right now, the error_handler_t::ignore option has few test cases, but these in connection with the current implementation all ignore invalid UTF-8 sequences and do not follow the documentation "all bytes are copied to the output unchanged".
  • Therefore, I think fixing the semantics of the error_handler_t::ignore function to meet that sentence from the documentation would be very surprising for existing implementations as after the fix, suddenly more bytes appear in the output.
  • Instead, I would propose adding a new enumerator, maybe error_handler_t::copy or error_handler_t::keep, that implements an approach that meets the sentence in the documentation and actually preserves input even in case of UTF-8 errors.

What do you think?

nlohmann avatar Dec 18 '24 08:12 nlohmann

I think adding a new enumerator would be the best in this case then. Fixing the semantics of error_handler_t::ignore would make anyone who currently depends on it's current behavior unhappy. and we can probably update the docs to make it more accurate.

jordan-hoang avatar Dec 18 '24 09:12 jordan-hoang

Yes I also think adding a new enumerator which preserves invalid UTF8 makes sense.

t-b avatar Dec 18 '24 12:12 t-b

Thanks for the input. Any preference on the name? error_handler_t::keep?

nlohmann avatar Dec 19 '24 07:12 nlohmann

Sounds good!

t-b avatar Dec 19 '24 10:12 t-b

Yeah keep sounds good.

jordan-hoang avatar Dec 19 '24 23:12 jordan-hoang

@gentooise @t-b @jordan-hoang Please take a look at #4555.

nlohmann avatar Dec 22 '24 13:12 nlohmann