ChezScheme icon indicating copy to clipboard operation
ChezScheme copied to clipboard

Cannot access file and directory _names_ that are invalid UTF-8

Open cosmos72 opened this issue 9 months ago • 11 comments

The functions

(current-directory) (cd) (directory-list)
(file-exists?) (file-regular?) (file-directory?) (file-symbolic-link?) (file-access-time) (file-change-time)
(file-modification-time) (mkdir) (delete-file) (delete-directory) (rename-file) (chmod) (get-mode)

described in Section 9.16. File System Interface https://cisco.github.io/ChezScheme/csug10.0/io.html#./io:h16 operate on file names and directory names represented as Scheme strings.

When actually accessing the file system on POSIX systems, such strings are automatically converted from/to UTF-8.

This has the side effect that existing files and directories whose names are invalid UTF-8 cannot be accessed by the functions listed above.

Example:

$ mkdir example
$ cd example
$ touch $(printf 'AAA\xffzzz')
$ ls -l aaa*
-rw-r--r-- 1 user users 0 Feb 20 10:15 'AAA'$'\377''zzz'
$ chezscheme
> (define x (car (sort string<? (directory-list "."))))

> x
"AAA�zzz"

> (char->integer (string-ref x 3))
65533

> (delete-file x)
#f

> (delete-file x #t)
Exception in delete-file: failed for AAA�zzz: no such file or directory
Type (debug) to enter the debugger.

The problem is: byte #xff is not a valid UTF-8 sequence, and (directory-list) converts it to replacement character #xFFFD as per UTF-8 error-handling rules.

As a consequence, the file created with shell command touch $(printf 'AAA\xffzzz') and all other files or directories whose names are invalid UTF-8 cannot be accessed with Chez Scheme functions listed above.

Since POSIX file system specifications do not require that files or directory names are valid UTF-8, this leaves the above Chez Scheme functions in the uncomfortable position of failing on some valid POSIX file and directory names.

A solution could be to convert file and directory names from/to UTF-8b (note the 'b') instead of UTF-8, because UTF-8b is an extension of UTF-8 designed exactly to losslessly convert any byte or byte sequence.

For a definition of UTF-8b, see https://peps.python.org/pep-0383 https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

cosmos72 avatar Feb 20 '25 09:02 cosmos72

[ADDENDUM]

In case it's not explaned clearly enough, I am talking about the name of files and directories in a POSIX file system - not their content.

cosmos72 avatar Feb 20 '25 09:02 cosmos72

I don't think UTF-8b as such will help with roundtripping through Chez Scheme strings as they are UTF-32 internally. But there is of course plenty of room in to represent invalid UTF-8 bytes in them. I wonder, what happens if someone tries to use a string and not just roundtrip a string that contains bad bytes? As far as I can see the minimum interface is a predicate for if a string is improper and that you can decode it to a bytevector, and most other string functions would just throw if you try to use them.

Wouldn't it be more explicit if these file functions alternatively took or returned a bytevector?

melted avatar Feb 24 '25 21:02 melted

In my shell "schemesh" written in Chez Scheme, I am currently converting byte sequences (interpreted as UTF-8b) to Chez Scheme strings (yes, they are UTF-32 internally) and back. It works flawlessly.

The trick I used is a custom C function that calls Schar(0xdc80...0xdcff) and returns to Chez the produced characters, because vanilla (integer->char) intentionally throws for those values.

Only 128 characters need to be produced in this way, so they can be cached in Chez - no need to call C every time.

The other direction, (char->integer) is trivial: it already correctly converts characters in the range #\xdc80 ... #\dcff

[UPDATE] about modifying the functions to also accept bytevectors: I have implemented that too in my schemesh, but in Scheme strings are more convenient. And existing programs would need to be updated to take advantage of them.

Clearly, returning bytevectors instead of strings from (directory-list) would break compatibility with exisisting programs.

All considered, my proposal to convert file names from/to UTF-8b is equivalent to saying that these functions accept or return UTF-32b Scheme strings: most programs will not even notice the change and benefit from the fix, while the ones who do notice the UTF-32b characters are currently broken anyway because of this bug

cosmos72 avatar Feb 25 '25 12:02 cosmos72

[UPDATE 2] This issue is actually more widespread than initially reported.

It also affects Chez Scheme on Windows because filenames there are not required to be valid UTF-16, see https://zaferbalkan.com/surrogates/

And on POSIX systems it also affects all string-based interfaces to the operating system, including at least:

  • (command-line-arguments) : a user or a script may launch Chez Scheme with command line arguments that are not valid UTF-8. For example, a file name to be loaded and executed. But (command-line-arguments) will return the arguments after replacing any invalid UTF-8 with the replacement character #\FFFD - this is internally performed by (utf8->string) - thus the command line arguments will be garbled, and if they refer to a file name, the file will not be found.

  • (open-file-input-port) (open-file-output-port) (open-file-input/output-port) accept file paths as UTF-32 Scheme strings, thus they cannot represent - much less open - any file whose name, when represented in bytes as POSIX file systems do, contain invalid UTF-8

  • (open-process-ports) and (process) accept arguments as UTF-32 Scheme strings, thus cannot launch executables whose path contain invalid UTF-8.

cosmos72 avatar Feb 26 '25 08:02 cosmos72

UTF-8b doesn’t work for Scheme because it requires unpaired surrogates to be supported as character objects and in strings, which isn’t allowed by R6RS.

The solution devised for R7RS large by John Cowan (many years contributor to the Unicode Standard) is described here: https://codeberg.org/scheme/r7rs/wiki/Noncharacter-error-handling

dpk avatar Mar 02 '25 12:03 dpk

Could you kindly help me find the relevant R6RS section that implies unpaired surrogates are not allowed as character objects and in strings?

If that's the case, then strict R6S6 compliance is surely difficult to obtain.

On the other hand, the proposal you cited has an unpleasant side effect: it mangles valid UTF-8 file names that happen to decode to valid Unicode "noncharacters" (i.e. codepoints in the range U+FDD0..U+FDEF).

Unicode standard https://www.unicode.org/versions/Unicode15.0.0/ch23.pdf states:

Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use.

If "application" is taken to mean an R6RS-compliant Scheme implementation, then the proposal above is acceptable - although personally I still find it complicated and invasive.

If "application" is taken to mean an R6RS-compliant Scheme program - and I find this interpretation more likely - then in this context the R6RS-compliant Scheme implementation has the role of a library, and should not mangle noncharacters because they are reserved for the application.

cosmos72 avatar Mar 02 '25 15:03 cosmos72

https://www.r6rs.org/final/html/r6rs/r6rs-Z-H-14.html#node_sec_11.11

dpk avatar Mar 02 '25 15:03 dpk

Noncharacter error handling does not mangle anything: mangling implies an irreversible process because some ambiguity would be created. The affected noncharacters (which are not all of the noncharacters in Unicode, only a small and well-defined subset) are safely quoted and unquoted, a reversible and unambiguous process.

I acknowledge there is a type aliasing issue, but this is the best solution available under the constraints that 1. file name objects (and many other strings that come from the operating system, such as environment variables) have to be representable as Scheme strings for backwards compatibility and 2. Scheme strings must be pure sequences of Unicode scalar values

dpk avatar Mar 02 '25 16:03 dpk

I meant mangling as in "C++ function names mangling", which is reversible. Yes, it's a kind of escaping, since noncharacter sequences are reversibly replaced with longer sequences. Still, they are replaced, which means an application can no longer transparently use them.

cosmos72 avatar Mar 02 '25 17:03 cosmos72

under the constraints that 1. file name objects (and many other strings that come from the operating system, such as environment variables) have to be representable as Scheme strings for backwards compatibility

FWIW Racket extends procedures that operate on file paths to accept path-string?s: either strings, treated the usual way, or path? values, which are essentially bytevectors with some invariants.

LiberalArtist avatar Apr 01 '25 03:04 LiberalArtist

under the constraints that 1. file name objects (and many other strings that come from the operating system, such as environment variables) have to be representable as Scheme strings for backwards compatibility

FWIW Racket extends procedures that operate on file paths to accept path-string?s: either strings, treated the usual way, or path? values, which are essentially bytevectors with some invariants.

R6RS takes the same approach; filenames can be strings or implementation-defined objects. Chez Scheme does not yet make use of the latter option to represent filenames that cannot be encoded using the "native notation of filesystem paths".

As far as the standard goes, only the command-line procedure is affected by the issue, which should be fixed in some way by #574.

mnieper avatar Apr 23 '25 08:04 mnieper