UTF-8 invalid characters are not always ignored when dumping with error_handler_t::ignore
Description
According to this: https://json.nlohmann.me/api/basic_json/dump/#parameters , when passing error_handler_t::ignore to dump() function, invalid UTF-8 characters should be ignored and copied as-is into the final string.
However, I'm debugging the following minimal code:
std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);
and the final test_dump string contains test\005 (byte \334 is gone).
Is this expected? Am I missing something?
Reproduction steps
Just try to run/debug the following:
std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);
Expected vs. actual results
Actual: test_dump contains test\005
Expected: test_dump contains test\334\005
Minimal code example
std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);
Error messages
No response
Compiler and operating system
gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924
Library version
3.11.2
Validation
- [ ] The bug also occurs if the latest version from the
developbranch is used. - [ ] I can successfully compile and run the unit tests.
Yes, there seems to be a logic error. \334 is written to the string buffer, but then overwritten by \u0005. The reason for this is line
bytes = bytes_after_last_accept;
but bytes_after_last_accept was not incremented after reading \334.
Interesting, I'll take a look perhaps.
I had another look at the issue and this needs some discussion:
- Right now, the
error_handler_t::ignoreoption has few test cases, but these in connection with the current implementation all ignore invalid UTF-8 sequences and do not follow the documentation "all bytes are copied to the output unchanged". - Therefore, I think fixing the semantics of the
error_handler_t::ignorefunction to meet that sentence from the documentation would be very surprising for existing implementations as after the fix, suddenly more bytes appear in the output. - Instead, I would propose adding a new enumerator, maybe
error_handler_t::copyorerror_handler_t::keep, that implements an approach that meets the sentence in the documentation and actually preserves input even in case of UTF-8 errors.
What do you think?
I think adding a new enumerator would be the best in this case then. Fixing the semantics of error_handler_t::ignore would make anyone who currently depends on it's current behavior unhappy. and we can probably update the docs to make it more accurate.
Yes I also think adding a new enumerator which preserves invalid UTF8 makes sense.
Thanks for the input. Any preference on the name? error_handler_t::keep?
Sounds good!
Yeah keep sounds good.
@gentooise @t-b @jordan-hoang Please take a look at #4555.