lark icon indicating copy to clipboard operation
lark copied to clipboard

Skip BOM (Byte order mark)

Open jayvdb opened this issue 4 years ago • 7 comments

Perhaps it should be opt-in, but most usage would expect a BOM is ignored.

jayvdb avatar Jul 11 '19 02:07 jayvdb

Please provide more details

erezsh avatar Jul 11 '19 09:07 erezsh

When a stream contains a BOM , usually it is 'utf-8-sig' opened as default 'utf-8', and typically it is a file created on Windows, or pipe in Powershell.

It is reasonable to say that this should be dealt with by the caller, either by opening using 'utf-8-sig', or adding the BOM to the grammar so it is ignored, but most will do neither.

See comparable recurring issue at https://github.com/antlr/antlr4/issues/175

IMO it makes sense to solve this in the library, as a sane default and it only present at the beginning of a stream so shouldnt be a performance problem.

jayvdb avatar Jul 11 '19 13:07 jayvdb

Sorry, I'm not sure I understand. Perhaps you can provide an example?

erezsh avatar Aug 10 '19 12:08 erezsh

The unicode standard has what's call a Byte-Order-Mark in the different encodings (UTF-8, UTF-16, UTF-32). In the other encodings (16, 32) the BOM actually does its job and shows in what endianness the stream is encoded. This does not matter for UTF-8, so the only job of the UTF-8 BOM is to show that the stream is encoded in UTF-8 and not UTF-16/32/latin1/... Some editors create this BOM automatic, and other editor just don't render it, so the user is not always sure whether or not it exists.

As example, the start of a file with UTF-8 BOM might look like this (in hex):

EF BB BF 23 23 20 4C ....

The EF BB BF part is the BOM, the rest is the actual content of the file.

What @jayvdb now proposes is that this BOM gets ignored automatically in lark, even if only optionally. I am currently working on a PR, but I don't know enough about the way configs are done in Lark to know how to implement the API.

I would propose that a new keyword-only argument get's added to the Lark class, which then get's somehow passed down to lexer.Lexer where approximately the following code get's run at the beginning of the stream:

# global constant
_BOM = '\xEF\xBB\xBF'

# somewhere, probably in TraditionalLexer.lex and ContextualLexer.lex
        if self.ignore_bom and stream.startswith(BOM):
            stream = stream[len(_BOM):]

The primary problem with this is that the indicies will not match up quite from a python point of few, but will match up from a user/editor point of view. (or, depending on the exact implementation, the other way around)

MegaIng avatar Aug 10 '19 22:08 MegaIng

I see. I guess I just never encountered it.

Regarding the appropriate place, maybe in Lark.parse, considering that it isn't a parsing issue.

erezsh avatar Aug 11 '19 07:08 erezsh

The unicode standard has what's call a Byte-Order-Mark in the different encodings (UTF-8, UTF-16, UTF-32). In the other encodings (16, 32) the BOM actually does its job and shows in what endianness the stream is encoded. This does not matter for UTF-8, so the only job of the UTF-8 BOM is to show that the stream is encoded in UTF-8 and not UTF-16/32/latin1/... Some editors create this BOM automatic, and other editor just don't render it, so the user is not always sure whether or not it exists.

I am pretty sure that the Unicode standard says that the BOM should not be used for UTF-8.

julie777 avatar Apr 04 '22 17:04 julie777

https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-with-bom#:~:text=On%20the%20meaning%20of%20the,is%20encoded%20in%20UTF%2D8.

On the meaning of the BOM and UTF-8: The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8.

philip-h-dye avatar May 18 '23 18:05 philip-h-dye