yaml-cpp
yaml-cpp copied to clipboard
Encoding of \_ and \N is inconsistent with \x, \u, \U, \L and \P
Looking at example 5.15 in https://github.com/jbeder/yaml-cpp/blob/master/test/integration/handler_spec_test.cpp:
OnScalar(_, "!", 0,
"Fun with \x5C \x22 \x07 \x08 \x1B \x0C \x0A \x0D \x09 \x0B " +
std::string("\x00", 1) +
" \x20 \xA0 \x85 \xe2\x80\xa8 \xe2\x80\xa9 A A A"));
It appears that yaml-cpp
encodes \L
and \P
in UTF-8, but encodes \_
(non-breaking space) and \N
(next line) as single-byte values containing the unicode code point. Further inspection reveals that \xYY
and \uYYYY
also use UTF-8, so the behavior of \_
and \N
appears to be a bug.
YAML string | code point | correct UTF-8 encoding | cpp-yaml std::string output |
---|---|---|---|
"\xA0" |
U+00A0 | 0xC2 0xA0 |
0xC2 0xA0 |
"\u00A0" |
U+00A0 | 0xC2 0xA0 |
0xC2 0xA0 |
"\_" |
U+00A0 | 0xC2 0xA0 |
0xA0 |
"\N" |
U+0085 | 0xC2 0x85 |
0x85 |
"\L" |
U+2028 | 0xE2 0x80 0xA8 |
0xE2 0x80 0xA8 |
"\P" |
U+2029 | 0xE2 0x80 0xA9 |
0xE2 0x80 0xA9 |
Confirming still the same behavior with current master. Probably because of : https://github.com/jbeder/yaml-cpp/blob/master/src/exp.cpp#L119
@jbeder could you at least comment if this is really the desired behavior or a undesired collateral effect/bug ?
@ExpHP are you aware of a good reason for yamlcpp to escape nbsp ?
Hmm? You mean, a reason for the existing feature in yamlcpp that supports \_
escapes? Without knowing much about the project, I would assume this is because yamlcpp describes itself as "matching the YAML 1.2 spec", and the \_
sequence is listed in the spec under Escaped Characters...
If you're asking "why would anyone use it," I don't remember what I was working on two years ago when I ran into this issue.
If you are asking, "a reason to have a 0xC2 byte," then it's because the output is malformed UTF-8 without it...
Worth noting: in the examples in the YAML 1.1 spec, it is suggested that \_
is equivalent to \xA0
. This is correct, but they mean \xA0
in YAML (not in C++). \xYY
has different meaning between YAML and C++ string literals since YAML strings are strings of codepoints while C++ strings are strings of bytes. This confusion is likely the origin of the bug.
In the 1.2 spec I notice that these places now use \u00A0
, perhaps to head off any such possible confusion.