ccl icon indicating copy to clipboard operation
ccl copied to clipboard

UTF-8 /BOM issues: load fails when UTF-8 source code file starts with byte order mark (BOM)

Open dwolter opened this issue 3 years ago • 2 comments

When trying to load a file or read from a file saved in UTF-8 starting with UTF-8 BOM 0xEF 0xBB 0xBF, first a symbol is read whose name is a single char with char code 65279, i.e., 0xfeff (UTF-16 (BE) BOM). In case of loading, an undefined variable error is signalled. This behaviour is both unexpected (reading UTF-16 BOM instead of UTF-8 BOM) and problematic (load error).

possible fix: The manual (chapter 4.5) already suggests that "A byte order mark from a UTF-8 encoded input stream is not treated specially and just appears as a normal character from the input stream. It is probably a good idea to skip over this character."

dwolter avatar Apr 19 '22 14:04 dwolter

This occurs even if we specify the correct externa-format (:utf-8).

cl-user> ! od -t x1 ~/src-utf-8.lisp
0000000    ef  bb  bf  28  64  65  66  76  61  72  20  2a  66  6f  6f  2a
0000020    20  27  68  69  29  0a  0a                                    
0000027
; No value
cl-user> (load #P"~/src-utf-8.lisp")
; Evaluation aborted on #<unbound-variable #x30200706355D>.
cl-user> (load #P"~/src-utf-8.lisp" :external-format :utf-8)
; Evaluation aborted on #<unbound-variable #x30200708060D>.
cl-user> 

IMO, this is something that must be managed at the level of the encoding/decoding, ie. external-format, but understandably, this opens a small can of worms, (what to do with BOMs in the middle of files? what about concatenations? etc). (That would suggest a feature request/improvement).

You can deal with it as suggested in the manual, by having:

(defun ignore-bom (stream ch) (declare (ignore stream ch)) nil)
(set-macro-character #\U+FEFF 'ignore-bom)
(set-macro-character #\U+FFFE 'ignore-bom)

in your rc file.

informatimago avatar Apr 19 '22 16:04 informatimago

Thanks for the prompt reply and pointing me to workaround from the manual that I missed! It would be nice to see the issue resolved or warnings printed (for naive users like me who thought by UTF we had overcome text encoding issues).

dwolter avatar Apr 19 '22 16:04 dwolter

UTF-8 always has the same byte order, so starting UTF-8 data with a BOM (byte order mark) is not terribly useful.

One major selling point of UTF-8 is that it is ASCII-compatible. Programs expecting ASCII will certainly not know what to do with a BOM.

I think the phrase "It is probably a good idea to skip over this character" is a good example of the manual trying to be humorous in a wry way. :-)

xrme avatar Aug 11 '23 06:08 xrme