swipl-devel
swipl-devel copied to clipboard
`user_input` on Windows can't be set to `encoding(octet)`
There seems to be no way for SWI Prolog to read exactly one byte from user_input on Windows. This is because user_input's encoding is locked as wchar_t which on Windows is UTF-16/UCS-2.
1 ?- get_byte(X).
ERROR: No permission to read bytes from TEXT stream `current_input'
ERROR: In:
ERROR: [12] get_byte(_1032)
ERROR: [11] toplevel_call(user:user: ...) at c:/program files/swipl/boot/toplevel.pl:1315
The official advice is to switch the stream into encoding(octet) and read codes then. But but this also doesn't work[^1]:
2 ?- set_stream(user_input, encoding(octet)).
ERROR: No permission to encoding stream `user_input'
ERROR: In:
ERROR: [12] set_stream(user_input,encoding(octet))
ERROR: [11] toplevel_call(user:user: ...) at c:/program files/swipl/boot/toplevel.pl:1315
Similarly for encoding(utf8): user_input cannot change it's encoding property on Windows.
Consequences
This means there is no way to reliably read a UTF-8 encoded file into an SWI Prolog process via user_input. Any multi-byte UTF-8 sequences sent will be misinterpreted as UTF-16/UCS-2. For example:
$ cat broken_heart.txt
💔
$ od -t x1 -t u1 broken_heart.txt # A UTF-8 encoded file
0000000 f0 9f 92 94 0d 0a
240 159 146 148 13 10
0000006
% read_codes.pl
:- initialization(main, main).
main :-
read_stream_to_codes(user_input, Codes),
format('~k "~s"~n', [Codes, Codes]).
$ cat broken_heart.txt | swipl --quiet read_codes.pl
[240,376,8217,8221,10] "💔
"
(Shell examples were run from the Git Bash shell for Windows.)
Note
In the above example the codes list [240,376,8217,8221,10] is being printed incorrectly as well. For instance, 8217 is U+2019 which is Right Single Quotation Mark, not Æ, however neither of these characters appear in the input file, so text is being garbled. This is probably a combination of my terminal settings (Windows Terminal 1.22.11141.0) and the fact that user_output is also locked to wchar_t. I haven't looked into this yet.
Is this an Issue?
So why not instead read in from a socket or from a file handle?
In some cases you do actually need to read from stdin. For example one could not correctly implement the Bash command read in SWI Prolog on Windows if the input is a non-UTF-16 source.
The place I've run into this is in trying to get jamesnvc/lsp_server working on Windows. At least for Neovim integration, an LSP client is normally expected to read in UTF-8-encoded JSON-RPC requests through stdin. The project is having to ask Windows users to use a socket-based client instead (see this issue), which brings along it's own issues.
Fixes
The easiest way to solve UTF-8→UTF-16 encoding issues on stdin would be to allow encoding(octet) for stdin and let the user do the conversion on their own via utf8_codes//1.
Eventually of course it would be great to support encoding(utf8) on stdin as well, and also to solve this same issue for stdout.
[^1]: Prior discussion: https://swi-prolog.discourse.group/t/no-permission-to-encoding-a-user-stream/5368/11