Specify whether files may begin with a UTF-8 BOM

Open brandjon opened this issue 4 years ago • 1 comments

The language spec currently says that files are UTF-8 encoded. Following the FR in bazelbuild/bazel#4551, we should decide whether to allow an optional BOM (EF BB BF) at the beginning of the file, which would be stripped before lexxing.

BOMs are unnecessary and not recommended for UTF-8, but prohibiting them is hostile to some windows text editors. Conversely, allowing them seems harmless.

From what I can tell, standard UTF-8 passes a decoded BOM through unmodified without stripping. But that doesn't stop plenty of decoders from stripping the BOM, e.g. Python's utf-8-sig codec (as distinct from its utf-8 codec).

Feb 16 '21 18:02 brandjon

Conversely, allowing them seems harmless.

Not harmless: it has a complexity cost, and the lexer is already complicated.

From what I can tell, standard UTF-8 passes a decoded BOM through unmodified without stripping.

A BOM is just a special kind of space character that our lexer rejects (outside of a string literal). I would prefer that we teach people to fix their misconfigured editors to stop putting unwanted invisible spaces in text files.

May 06 '21 15:05 adonovan