yaml-cpp icon indicating copy to clipboard operation
yaml-cpp copied to clipboard

Encoding of \_ and \N is inconsistent with \x, \u, \U, \L and \P

Open ExpHP opened this issue 5 years ago • 4 comments

Looking at example 5.15 in https://github.com/jbeder/yaml-cpp/blob/master/test/integration/handler_spec_test.cpp:

      OnScalar(_, "!", 0,
               "Fun with \x5C \x22 \x07 \x08 \x1B \x0C \x0A \x0D \x09 \x0B " +
                   std::string("\x00", 1) +
                   " \x20 \xA0 \x85 \xe2\x80\xa8 \xe2\x80\xa9 A A A"));

It appears that yaml-cpp encodes \L and \P in UTF-8, but encodes \_ (non-breaking space) and \N (next line) as single-byte values containing the unicode code point. Further inspection reveals that \xYY and \uYYYY also use UTF-8, so the behavior of \_ and \N appears to be a bug.

YAML string code point correct UTF-8 encoding cpp-yaml std::string output
"\xA0" U+00A0 0xC2 0xA0 0xC2 0xA0
"\u00A0" U+00A0 0xC2 0xA0 0xC2 0xA0
"\_" U+00A0 0xC2 0xA0 0xA0
"\N" U+0085 0xC2 0x85 0x85
"\L" U+2028 0xE2 0x80 0xA8 0xE2 0x80 0xA8
"\P" U+2029 0xE2 0x80 0xA9 0xE2 0x80 0xA9

ExpHP avatar Jan 18 '20 00:01 ExpHP

Confirming still the same behavior with current master. Probably because of : https://github.com/jbeder/yaml-cpp/blob/master/src/exp.cpp#L119

@jbeder could you at least comment if this is really the desired behavior or a undesired collateral effect/bug ?

WilliamTambellini avatar Apr 21 '22 23:04 WilliamTambellini

@ExpHP are you aware of a good reason for yamlcpp to escape nbsp ?

WilliamTambellini avatar Apr 23 '22 03:04 WilliamTambellini

Hmm? You mean, a reason for the existing feature in yamlcpp that supports \_ escapes? Without knowing much about the project, I would assume this is because yamlcpp describes itself as "matching the YAML 1.2 spec", and the \_ sequence is listed in the spec under Escaped Characters...

If you're asking "why would anyone use it," I don't remember what I was working on two years ago when I ran into this issue.

If you are asking, "a reason to have a 0xC2 byte," then it's because the output is malformed UTF-8 without it...

ExpHP avatar Apr 23 '22 19:04 ExpHP

Worth noting: in the examples in the YAML 1.1 spec, it is suggested that \_ is equivalent to \xA0. This is correct, but they mean \xA0 in YAML (not in C++). \xYY has different meaning between YAML and C++ string literals since YAML strings are strings of codepoints while C++ strings are strings of bytes. This confusion is likely the origin of the bug.

In the 1.2 spec I notice that these places now use \u00A0, perhaps to head off any such possible confusion.

ExpHP avatar Apr 23 '22 19:04 ExpHP