big-list-of-naughty-strings UTF-8 byte ordermarker mid file

I apologize if it this doesn't count but one of my favorite/least favorite bugs is having random utf-8 byte order marker(BOM) mid file. This is often caused by naively concatenating files together.

Aug 18 '15 07:08 AndrewKL

The presence of a BOM anywhere in UTF-8 is a bug.

Aug 19 '15 19:08 stuartpb

@stuartpb To quote the link you just cited (emphasis mine):

The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use.[3] Byte order has no meaning in UTF-8,[4] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[5][6]

Aug 19 '15 19:08 ssokolow

@ssokolow Sure, I'm not saying it shouldn't be included in the corpus - just that it shouldn't be treated as a bug that occurs "by naively concatenating files together" (or that it shouldn't be stripped by anything that isn't expected to round-trip its exact input between encodings). Its root cause is software that adds a BOM to UTF-8, at any point. See the text immediately following your excerpt:

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[7]

Aug 19 '15 20:08 stuartpb

@stuartpb However, that does not apply to many web applications because they accept "plaintext" generated by Microsoft applications or Google Docs.

Even so, Microsoft compilers[9] and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

Aug 19 '15 20:08 ssokolow

@ssokolow Sure - we're talking about two different parts of the RFC, which apply to two different scenarios. If your data round-trips encodings without metadata, stripping the BOM is a bug; if it doesn't, keeping it is a bug.

Aug 19 '15 20:08 stuartpb

@stuartpb My point was that, "The presence of a BOM anywhere in UTF-8 is a bug." sounded like you might have been misinterpreting the way this case should be tested and, when you clarified, it still felt like you might have been misinterpreting... just in how the standard applied to the applications in question rather than what the standard said... something I no longer think.

Aug 19 '15 20:08 ssokolow