`charset` setting should be reconsidered: using `utf-8-bom` instead of `utf-8`

Open julealgon opened this issue 1 year ago • 1 comments

Describe the feature

UTF-8 with BOM is the standard used by Visual Studio. There are comments around advocating for its use, such as this one: https://github.com/dotnet/docs/pull/19794#issuecomment-668619665

I'm aware there are also places suggesting that the byte order mark is not recommended, but I'd like to open this to raise the discussion on what the best default for this would be considering this project is supposed to host good defaults for folks to use.

https://github.com/editorconfig/editorconfig/issues/297

Interestingly enough, I've also noticed that when creating a brand new project in VS, and then asking to create a new .editorconfig file, the charset option is not set. I don't understand why that is the case.

Rules

https://spec.editorconfig.org/index.html#:~:text=are%20case%20insensitive.-,charset,-Set%20to%20latin1

Aug 23 '24 14:08 julealgon

The comment in https://github.com/dotnet/docs/pull/19794#issuecomment-668619665, which talks about some editors improperly loading UTF-8 files as the default local encoding (often Latin-1 for English speakers) was written in 2020. Now, in 2024, UTF-8 is used in over 98% of all web pages worldwide:

https://w3techs.com/technologies/cross/character_encoding/ranking

These days, any software that encounters a file without a byte-order mark should be assuming that it's UTF-8, and only falling back to try other encodings if parsing the file as UTF-8 encounters invalid sequences. Any software that does not assume UTF-8 by default is buggy and should be fixed. I'd be curious to know how many editors are assuming local encodings rather than UTF-8 in 2024.

The BOM is not necessary in UTF-8 (it has only one byte order, whether on big-endian or little-endian machines) and can cause problems (see here and here, and I've personally encountered problems with Bash scripts where the invisible U+FEFF before the #!/bin/bash caused the #! to not be recognized by the shell). Part of the genius of UTF-8 is how its first 128 characters are encoded identically to 7-bit ASCII, and therefore old software that expects ASCII would run correctly on UTF-8 encoded files (if they do indeed contain only characters from the ASCII set) without change. Adding the 0xEF 0xBB 0xBF sequence at the start of the file breaks that backwards compatibility, and gets rid of one of UTF-8's biggest advantages.

Because UTF-8 is finally pretty much everywhere, and can now safely be the default parsing option, it's finally time to kill off the BOM and get rid of the problems it causes.

Sep 25 '24 03:09 rmunn