rapidyaml icon indicating copy to clipboard operation
rapidyaml copied to clipboard

UTF-8 BOM causes incorrect indentation error for valid YAML

Open MatthewSteel opened this issue 1 month ago • 2 comments

Summary

When a YAML document starts with a UTF-8 byte-order-mark/BOM, the parser incorrectly tracks column positions on the first line, which causes indentation mismatch errors when parsing continuation keys in indentless block maps.

Minimal Reproduction

The following valid YAML fails to parse when it includes a UTF-8 BOM:

// Valid YAML with UTF-8 BOM at the start
"\xef\xbb\xbf- a: 1\n  b: 2\n"

visually (the BOM is invisible),

- a: 1
  b: 2

Expected Behavior

Desired parse result is JSON [{"a": 1, "b": 2}] (this is what ruamel does), something like

+STR
+DOC
+SEQ
+MAP
=VAL :a
=VAL :1
=VAL :b
=VAL :2
-MAP
-SEQ
-DOC
-STR

Actual Behavior

Parser throws error:

parse error: incorrect indentation?
2:1:   b: 2  (size=6)
     ^~~~~~  (cols 1-7)

I believe the error happens because the indentation of the a key is "too large" by the BOM's 3 bytes/columns/characters, and when we look at b at a (wrongly) lower indentation we pop and don't find anything at the right level. Presumably one of the _handle_bom functions needs to do something slightly different.

MatthewSteel avatar Dec 03 '25 23:12 MatthewSteel

Thanks for the detailed report. I was able to reproduce. Hold on while I investigate, and hopefully fix.

biojppm avatar Dec 04 '25 18:12 biojppm

Not a blocker for my use-case, I can fix my yaml doc at the source. "Reject BOMs" would be fine behavior for me too I think.

Also, I have a suspicion that multibyte encodings aren't really supported but not sure, haven't tried to parse any. Unless this assertion is because the whole doc is transcoded up front for those encodings?

MatthewSteel avatar Dec 04 '25 19:12 MatthewSteel