lark
lark copied to clipboard
Skip BOM (Byte order mark)
Perhaps it should be opt-in, but most usage would expect a BOM is ignored.
Please provide more details
When a stream contains a BOM , usually it is 'utf-8-sig' opened as default 'utf-8', and typically it is a file created on Windows, or pipe in Powershell.
It is reasonable to say that this should be dealt with by the caller, either by opening using 'utf-8-sig', or adding the BOM to the grammar so it is ignored, but most will do neither.
See comparable recurring issue at https://github.com/antlr/antlr4/issues/175
IMO it makes sense to solve this in the library, as a sane default and it only present at the beginning of a stream so shouldnt be a performance problem.
Sorry, I'm not sure I understand. Perhaps you can provide an example?
The unicode standard has what's call a Byte-Order-Mark in the different encodings (UTF-8, UTF-16, UTF-32). In the other encodings (16, 32) the BOM actually does its job and shows in what endianness the stream is encoded. This does not matter for UTF-8, so the only job of the UTF-8 BOM is to show that the stream is encoded in UTF-8 and not UTF-16/32/latin1/... Some editors create this BOM automatic, and other editor just don't render it, so the user is not always sure whether or not it exists.
As example, the start of a file with UTF-8 BOM might look like this (in hex):
EF BB BF 23 23 20 4C ....
The EF BB BF
part is the BOM, the rest is the actual content of the file.
What @jayvdb now proposes is that this BOM gets ignored automatically in lark, even if only optionally. I am currently working on a PR, but I don't know enough about the way configs are done in Lark to know how to implement the API.
I would propose that a new keyword-only argument get's added to the Lark
class, which then get's somehow passed down to lexer.Lexer
where approximately the following code get's run at the beginning of the stream:
# global constant
_BOM = '\xEF\xBB\xBF'
# somewhere, probably in TraditionalLexer.lex and ContextualLexer.lex
if self.ignore_bom and stream.startswith(BOM):
stream = stream[len(_BOM):]
The primary problem with this is that the indicies will not match up quite from a python point of few, but will match up from a user/editor point of view. (or, depending on the exact implementation, the other way around)
I see. I guess I just never encountered it.
Regarding the appropriate place, maybe in Lark.parse
, considering that it isn't a parsing issue.
The unicode standard has what's call a Byte-Order-Mark in the different encodings (UTF-8, UTF-16, UTF-32). In the other encodings (16, 32) the BOM actually does its job and shows in what endianness the stream is encoded. This does not matter for UTF-8, so the only job of the UTF-8 BOM is to show that the stream is encoded in UTF-8 and not UTF-16/32/latin1/... Some editors create this BOM automatic, and other editor just don't render it, so the user is not always sure whether or not it exists.
I am pretty sure that the Unicode standard says that the BOM should not be used for UTF-8.
https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-with-bom#:~:text=On%20the%20meaning%20of%20the,is%20encoded%20in%20UTF%2D8.
On the meaning of the BOM and UTF-8: The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8.