UTF-8 BOM causes incorrect indentation error for valid YAML
Summary
When a YAML document starts with a UTF-8 byte-order-mark/BOM, the parser incorrectly tracks column positions on the first line, which causes indentation mismatch errors when parsing continuation keys in indentless block maps.
Minimal Reproduction
The following valid YAML fails to parse when it includes a UTF-8 BOM:
// Valid YAML with UTF-8 BOM at the start
"\xef\xbb\xbf- a: 1\n b: 2\n"
visually (the BOM is invisible),
- a: 1
b: 2
Expected Behavior
Desired parse result is JSON [{"a": 1, "b": 2}] (this is what ruamel does), something like
+STR
+DOC
+SEQ
+MAP
=VAL :a
=VAL :1
=VAL :b
=VAL :2
-MAP
-SEQ
-DOC
-STR
Actual Behavior
Parser throws error:
parse error: incorrect indentation?
2:1: b: 2 (size=6)
^~~~~~ (cols 1-7)
I believe the error happens because the indentation of the a key is "too large" by the BOM's 3 bytes/columns/characters, and when we look at b at a (wrongly) lower indentation we pop and don't find anything at the right level. Presumably one of the _handle_bom functions needs to do something slightly different.
Thanks for the detailed report. I was able to reproduce. Hold on while I investigate, and hopefully fix.
Not a blocker for my use-case, I can fix my yaml doc at the source. "Reject BOMs" would be fine behavior for me too I think.
Also, I have a suspicion that multibyte encodings aren't really supported but not sure, haven't tried to parse any. Unless this assertion is because the whole doc is transcoded up front for those encodings?