cabal icon indicating copy to clipboard operation
cabal copied to clipboard

'cabal check' or 'cabal sdist' ought to warn about non-ASCII characters

Open athas opened this issue 3 months ago • 17 comments

Describe the feature request

Hackage rejects tarballs that contain non-ASCII filenames, but neither cabal check nor cabal sdist complains. I suggest that one of these (probably cabal check) is modified to complain.

Additional context

A user (or bit of automation) can get quite far in a release process before encountering this at the very end.

athas avatar Aug 30 '25 09:08 athas

While cabal check should certainly check that names are valid for Hackage, this also points up that it's not the early 2000s any more: Hackage should probably accept UTF-8 these days.

geekosaur avatar Aug 30 '25 16:08 geekosaur

I will leave the decision to Hackage admins. It not as trivial as “UTF-8”, see RFC 9839, plus I guess OS oddities.

ffaf1 avatar Aug 30 '25 16:08 ffaf1

Yes, I figured that part, but plain ASCII is hard to defend in a Unicode world.

geekosaur avatar Aug 30 '25 16:08 geekosaur

I don't think cabal currently uses OsPath, so I'm wondering how you're planning to support UTF-8 correctly. Because base doesn't. It uses a roundtrip format which is invalid UTF-8:

  • https://hackage.haskell.org/package/base-4.21.0.0/docs/GHC-IO-Encoding.html#v:getFileSystemEncoding
  • https://hackage.haskell.org/package/base-4.21.0.0/docs/GHC-IO-Encoding.html#v:mkTextEncoding
  • https://peps.python.org/pep-0383/
  • https://unicode.org/L2/L2009/09236-pep383-problems.html

I'm also generally wondering how you will handle systems that are not UTF-8 (e.g. where you unpack the tarball). Strictly speaking, these issues can also partly happen with ASCII, because there are certainly ASCII incompatible encodings. But it's much more portable than UTF-8. And if your system isn't ASCII compatible, you have far bigger problems.

hasufell avatar Sep 23 '25 08:09 hasufell

I think that change will be so invasive as to piss off a whole bunch of downstreams. I'm surprised we only got one major complaint about https://github.com/haskell/cabal/pull/9718, which was small compared to what this would be.

geekosaur avatar Sep 23 '25 14:09 geekosaur

Converting to OsPath can be done gradually for certain components. Between the components that still use FilePath you can convert with decodeFS/encodeFS... you won't reap the full benefits until everything is converted, but it's a start.

The biggest challenge will probably be using process package, which is not yet OsPath compatible. There's also a number of issues with file-io yet that aren't resolved.

hasufell avatar Sep 23 '25 15:09 hasufell

I suspect the biggest problem will be getting it actually done; might require HF funding, will require getting someone to actually do it. And yeh, process will definitely be a problem since we use it heavily.

geekosaur avatar Sep 23 '25 15:09 geekosaur

I don't think cabal currently uses OsPath, so I'm wondering how you're planning to support UTF-8 correctly. Because base doesn't. It uses a roundtrip format which is invalid UTF-8:

Can you give an example of when this would matter? As long as paths round-trip it seems inconsequential how they are stored in Cabal's memory.

I'm also generally wondering how you will handle systems that are not UTF-8 (e.g. where you unpack the tarball). Strictly speaking, these issues can also partly happen with ASCII, because there are certainly ASCII incompatible encodings. But it's much more portable than UTF-8. And if your system isn't ASCII compatible, you have far bigger problems.

I too was curious whether there were many systems that POSIX caters to with this ASCII restriction. Couldn't actually find anything except for the first couple variants of FAT.

It also doesn't make the entirety of Hackage unusable on such platforms if they actually exist. Only packages with that contain unicode file names and are requested to be installed might cause problems.

Restricting to ASCII doesn't actually prevent the most common problem with filesystem compatibility that I've run across, case-insensitivity. All macOS native filesystems still default to being case-insensitive. I've seen this causing problems in Nixpkgs.

toonn avatar Oct 24 '25 16:10 toonn

Can you give an example of when this would matter? As long as paths round-trip it seems inconsequential how they are stored in Cabal's memory.

Roundtripping doesn't work across systems.

FilePath semantics use the current locale for decoding/encoding, see getFileSystemEncoding:

  • https://hackage.haskell.org/package/ghc-internal-9.1201.0/docs/src/GHC.Internal.IO.Encoding.html#getFileSystemEncoding
  • https://hackage.haskell.org/package/ghc-internal-9.1201.0/docs/src/GHC.Internal.IO.Encoding.Iconv.html#localeEncodingName
  • https://hackage.haskell.org/package/ghc-internal-9.1201.0/docs/src/GHC.Internal.IO.Encoding.Iconv.html#c_localeEncoding
  • https://gitlab.haskell.org/ghc/ghc/-/blob/master/libraries/ghc-internal/cbits/PrelIOUtils.c#L29

It then uses lone surrogates to do roundtripping of undecodable bytes.

Encoding/Decoding between different systems is completely undefined. You're very unlikely to receive the original bytes. Because you lost the information of the original encoding.

If you enforce UTF-8 on unix and UTF-16LE on windows, this might work, as in: fail if the current locale is set to anything else or explicitly run setFileSystemEncoding.

But the latter can also have unexpected consequences on systems that, well, aren't actually using those locales for their filepaths.

hasufell avatar Oct 24 '25 20:10 hasufell

What might also be possible is that we enforce UTF-8 at the tar boundary.

E.g. at the sender, we read the filepaths with current locale and receive FilePath (which are unicode codepoints, so sort of a canonical representation). We encode those into UTF-8 (strictly, no roundtripping!) to write the filepath tar bytes. When the receiver gets the tar, they decode the filepath tar bytes via UTF-8 again. Then we have FilePath. When writing the files to disc now via base functions, they're converted from unicode code points to whatever the current locale is.

This does not maintain the original bytes from the sender! But it specifies that the pack/unpack part is enforced unicode.

This might require a patch to the tar package.

hasufell avatar Oct 24 '25 21:10 hasufell

I don't see where the different systems come in. The only part that is transferred to another system is the tarball downloaded from Hackage, no? And the package name, I suppose. But neither of those would use Cabal's internal representation?

Why is any encoding into UTF-8 required? The tarball should simply be created without the ASCII (POSIX compliant file names really) limitation.

toonn avatar Oct 28 '25 16:10 toonn

  1. If you're using tar, you're using POSIX.
  2. POSIX limits less than you think: only NUL and / are special. You can even use e.g. BIG5 with it. It's ensuring other systems can make sense of the tarball's contents that brings in e.g. UTF-8 encoding, as an existing and widely implemented and understood standard.

geekosaur avatar Oct 28 '25 17:10 geekosaur

  1. If you're using tar, you're using POSIX.

Not sure what you mean by that. Both GNU Tar and Libarchive "BSD" Tar support more than POSIX portable file names, as does the Haskell library.

2. POSIX limits less than you think: only `NUL` and `/` are special. You can even use e.g. BIG5 with it.

Yes and no, POSIX does have a concept of portable file names, which allows only a subset of ASCII.

But all of this is rather beside the point. Hackage checking that tarballs only contain portable filenames is what prevents this, no? Allowing a broader set of file names on Hackage, ideally opaque byte sequences, doesn't seem like it'd be very difficult.

The interpretation of the byte sequences is a UI choice and is only relevant when showing the file names, be that on Hackage or in Cabal output and a choice will have to be made on what to do with bytes that do not form correct UTF-8. Is this assumption on my part wrong?

It's ensuring other systems can make sense of the tarball's contents that brings in e.g. UTF-8 encoding, as an existing and widely implemented and understood standard.

This is where the apparent rarity of file systems that only support a small subset of byte sequences comes in. Are there really systems that would not be able to support this in practice? And does it matter because they would only run into the problem when a project depends transitively on a package with problematic file names in the tarball?

Would such file systems not be likely to also have restrictions on something like the length of file names and paths?

toonn avatar Oct 28 '25 20:10 toonn

I don't see where the different systems come in. The only part that is transferred to another system is the tarball downloaded from Hackage, no? And the package name, I suppose.

Yes.

You have:

  • system A: the creator of the tarball
  • system B: the unpacker of the tarball

Let's say system A has UTF-8, system B has EUC-JP, which is only ascii compatible.

If you enforce UTF-8 in general in cabal, then we have a regression on system B.

If we just unpack without caring about the local encoding on system B, then the end user will potentially get garbled output or decoding errors somewhere down the line.

If OsPath is consistently used, then decoding errors are not possible. But you'll still get garbled output in e.g. your filemanager most likely. If we explicitly convert to the local encoding (encoding A -> UTF-8 -> encoding B) then we lose the original bytes, but attempt to preserve the meaning of the unicode codepoints.

This is why I'm saying that ASCII is more portable. If we venture out of it, we have to make decisions for those encodings that are ascii compatible, but not UTF-8.

hasufell avatar Oct 29 '25 00:10 hasufell

Are there actually file systems that restrict file names to EUC-JP? The contents of files is already not restricted to ASCII so I assume you're not talking about the encoding for plain text.

toonn avatar Oct 31 '25 19:10 toonn

@toonn I don't really understand your remark. And no, I don't believe there are such filesystems.

hasufell avatar Nov 02 '25 15:11 hasufell

Does Tar not treat paths as opaque byte sequences? How does Cabal's internal handling of UTF-8 come into the issue?

toonn avatar Nov 04 '25 18:11 toonn