team icon indicating copy to clipboard operation
team copied to clipboard

Cross-platform file system abstractions

Open killercup opened this issue 6 years ago • 70 comments

(moderated summary by the WG)

  • .., .., // not being auto- handled
  • windows gotchas, like long file paths (which require extended-length paths aka "verbatim paths" as they're called in std)
    • https://github.com/dherman/verbatim
    • https://github.com/rust-lang/rust/issues/32689
    • https://github.com/BurntSushi/ripgrep/issues/364
    • How do we deal with these and relative paths?
  • absolute paths on Windows (UNC prefices)
    • https://github.com/notion-cli/notion/blob/master/crates/verbatim/src/lib.rs
  • UTF-8 filenames on MacOS (the filesystem normalizes UTF-8 strings)
    • https://github.com/BurntSushi/ripgrep/issues/845
  • atomic file writes
  • Path.join: A method that both concatenates a path or completely replaces the existing one based on the input ... when did one ever have that use-case?
  • "Simple" API that comes at the cost of performance for rapid prototyping / throwaway "scripts", see also https://github.com/rust-lang-nursery/cli-wg/issues/26
    • Priority should be on working out how the normal rust stuff should behave before wrapping it up in the high level, python-pathlib-like API.
  • format! a path for a command line argument or mutating a path can cause problems with the non-UTF8 nature of OsStr

@killercup's original post

In the first meeting, we talked a bit about the pain points of dealing with files and path in a cross-platform manner.

One idea is to create or improve crates that provide higher-level abstractions than std in this area.

killercup avatar Feb 20 '18 19:02 killercup

From the etherpad:

  • file path handling
    • .., .., // not being auto- handled
    • windows gotchas, like long file paths (which require extended-length paths aka "verbatim paths" as they're called in std)
    • absolute paths on Windows (UNC prefices)
      • https://github.com/notion-cli/notion/blob/master/crates/verbatim/src/lib.rs
    • UTF-8 filenames on MacOS (the filesystem normalizes UTF-8 strings)

killercup avatar Feb 20 '18 19:02 killercup

Maybe also relevant (also from the etherpad):

  • streaming
    • how to thread progress computation through chained readers
    • handling different archive formats (this may be specific to me but I bet others will hit it)
      • how to stream zip files (no crates.io crate supports this AFAIK)
      • how to compute progress percentage of decompressing files

killercup avatar Feb 20 '18 19:02 killercup

I want to simply dump this here: I'm the author of imag and we have a rather nice abstraction for FS access... maybe you can learn from that. Our abstraction is rather specific for our use-case (each file has a TOML header and a "content" section), but the basic concept (a "store" which holds the files and they can then be borrowed from that store, so that concurrent access to one file is not possible) could be useful for other CLI apps.

More information about the Store can be found in its source or our (rather not-so-up-to-date, as we are basically a one-dev-project) documentation on the store. I'd say the source is a better source of information, though.

matthiasbeyer avatar Mar 01 '18 22:03 matthiasbeyer

One thing that we talked about at Mozilla which would be useful in Firefox, as well as possibly useful to expose to the web, would be an API for doing atomic file writes in "small" files.

As I understood it, the best way to do truly atomic writes efficiently is to copy the file, while inserting the desired modifications, into a new file. Once the new file has been written issue a rename to the original file name.

Alternatively, if the file is small enough and multiple modifications are expected, read the file contents into memory and each time that a modification should be done write out a new file and rename to the desired file destination.

The nice thing with this approach is that it can be done without complex journaling files, and without needing to call flush() which can be quite a perf bottleneck.

This was something that we were thinking of using for the numerous small configuration/database files that Firefox keeps.

I think something like this would fit well with Rust's focus on performance and safety.

I'm not sure that this is particularly CLI specific though. But since other filesystem stuff was discussed I figured I'd mention it.

sicking avatar Mar 02 '18 09:03 sicking

windows gotchas, like long file paths (which require extended-length paths aka "verbatim paths" as they're called in std)

For some reason I thought the APIs in std did this internally but I was wrong. :( I believe Python's standard library converts all paths to verbatim paths internally nowadays. This is unfortunate in that it's really easy to run into the 256 character limit in practice, but I'm not sure there's a way to square this with the zero copy philosophy behind things like OsStr and Path.

AFAIK it's OK to use verbatim paths in all Windows APIs, so perhaps a crate that provides a small wrapper type would be good enough? Something like VPath { inner: Cow<Path> }, and then VPath::new would take AsRef<Path>, and if it has the \\?\ prefix borrow it, if not prepend it to an owned PathBuf. The real trick would be enforcing that all your APIs take this path type, so that you can ensure you're only ever passing verbatim paths to OS APIs. Without having it in std that's going to be a bit of a pain.

(If someone writes such a crate I would also like it to have an easy to_wide method for getting a wchar_t*-compatible buffer to pass directly to Windows APIs using ffi. I've found OsStrExt::encode_wide clunky to use in practice.)

luser avatar Mar 21 '18 13:03 luser

absolute paths on Windows (UNC prefices) https://github.com/notion-cli/notion/blob/master/crates/verbatim/src/lib.rs

FYI this link 404s now, the code seems to have moved to https://github.com/dherman/verbatim

luser avatar Mar 21 '18 13:03 luser

FYI this link 404s now, the code seems to have moved to https://github.com/dherman/verbatim

...and after actually reading a little, it sounds like that crate is about 90% of what I proposed in my previous comment. :)

luser avatar Mar 21 '18 13:03 luser

See also:

  • https://github.com/rust-lang/rust/issues/32689
  • https://github.com/BurntSushi/ripgrep/issues/364

I would be interested to here more about how other ecosystems solve this. In particular, how do they deal with relative paths?

BurntSushi avatar Mar 21 '18 13:03 BurntSushi

@killercup I believe normalisation on happens on HFS+ on macOS, I don't believe APFS does normalisation on the mac.

XAMPPRocky avatar Mar 21 '18 13:03 XAMPPRocky

On Unicode normalization and HFS: https://github.com/BurntSushi/ripgrep/issues/845

BurntSushi avatar Mar 21 '18 13:03 BurntSushi

AFAIK it's OK to use verbatim paths in all Windows APIs, so perhaps a crate that provides a small wrapper type would be good enough?

My thought was to just start with a "normalize_path" crate/function that people can use regardless of what path API they are using, kind of like normalize-line-endings.

This crate would handle

  • removing extraneous /, ., etc
  • Resolve .. where possible (e.g. path/../file -> file)
  • Convert to verbatim paths

Then on top of this we could look into an "easy paths" api that is more like python's pathlib combined with easy_strings to help with the prototyper case (easy to reach for needed utilities, less concern for borrow checker at cost of memory or cpu time). This would call "normalize_path" on any untrusted input. I've been providing feedback on ergo_fs in hopes of going in this direction.

In particular, how do they deal with relative paths?

What is your concern with relative paths?

epage avatar Mar 21 '18 13:03 epage

What is your concern with relative paths?

From MSDN:

Because you cannot use the "\\?\" prefix with a relative path, relative paths are always limited to a total of MAX_PATH characters.

BurntSushi avatar Mar 21 '18 13:03 BurntSushi

Resolve .. where possible (e.g. path/../file -> file)

This is not generally possible in POSIX because of symlinks: if a/b/c is a symlink to a/d, then a/b/c/../foo is actually a/foo, not a/b/foo. Now that modern versions of Windows support symlinks too, I suspect that's also an issue there.

That said, Rust kind of has this already in the form of std::fs::canonicalize(). It's still kind of clunky, though, because it requires every path component to exist. Compare the Python pathlib Path.resolve() method: "If strict is False, the path is resolved as far as possible and any remainder is appended without checking whether it exists." That means you can normalise your input and output paths while parsing your command-line, without having to create them first.

Screwtapello avatar Mar 21 '18 14:03 Screwtapello

I'm mixed on what I'd expect for that symlink scenario. Either way, I assume that with a strict=False, we would allow that symlink bug to go through.

But for pathlib, apparently, the symlink bug is always there.

I'm surprised though, I thought handling of .. was part of parsing the path and not from an explicit operation (resolve).

epage avatar Mar 21 '18 14:03 epage

But for pathlib, apparently, the symlink bug is always there.

Pathlib should actually be OK: in my example, as it resolves a/b/c/../foo it will read the symlink c producing a/d/../foo, so by the time it processes the .. it will get the right answer.

I thought handling of .. was part of parsing the path

In POSIX, the kernel just takes each path segment and hands it to the filesystem to resolve, and the filesystem physically records a .. entry on disk that points to the parent directory. Since filesystem is the single source of truth, getting the right answer requires actual disk I/O. :(

Plan9 behaves the way you expect, but then it doesn't have symlinks so this whole situation just isn't a problem there.

Screwtapello avatar Mar 21 '18 14:03 Screwtapello

One of the issues that would need to be addressed in a better way is PathBuf.join.

A method that both concatenates a path or completely replaces the existing one based on the input ... when did one ever have that use-case? I'd guess that pretty much every piece of code using join without first checking whether the argument is relative or absolute is subtly broken.

The problem is not that the method exists, but that the individual operations that this method combines do not exist as their own methods.

Considering this and the Windows-related issues, I think this makes a good case for having separate types for absolute and relative paths.

soc avatar Mar 23 '18 18:03 soc

I've tried to do a quick and dirty summary of this. Please point out where I need to expand it!

I know on either users or reddit, I saw complaints about

  • handling of relative paths
  • Handling of paths to/from config files.
    • EDIT: Maybe this? https://github.com/rust-lang-nursery/cli-wg/issues/7#issuecomment-375292628

Anyone know where these were so we can reach out to the people that have concerns? Or add your own thoughts on these topics?

epage avatar Mar 23 '18 19:03 epage

A method that both concatenates a path or completely replaces the existing one based on the input ... when did one ever have that use-case?

I'd expect that to be the majority use-case for dealing with user-specified paths.

  • I take an input file from the command-line, path/to/my/input
  • The input file contains a directive like src = ../some/data, or possibly src = /usr/share/myapp/data
  • My program does let src_path = input_path.parent().unwrap().join(input_data.src) and I get the right answer

The wrinkle in the above is unwrapping the result of .parent(), but as long as my program always makes input_path into an absolute path before working with it, that should be fine.

Screwtapello avatar Mar 24 '18 05:03 Screwtapello

Join is useful to make absolute paths from relative paths, which in turn are useful for nearly all cfg files, for both sharing across OSes and moving cfg files. If you can guarantee some validations of the user made paths on a cfg (is relative, doesn't use any (not just the current) OS forbidden characters except '/' and closed quotes, len(cfgpath.parent().join(relativepath)) isn't larger than MAX_PATH_LENGTH in the current platform ) you can even assure it's somewhat OS portable.

Though i suppose it's better than all those preconditions are specified in a higher level cfg abstraction, join could still be be useful (in spite of being easy to blow up by passing a absolute path to the suffix).

It would have to deal with stuff like join("C:\", "relative/path") to be useful though (and the constructor with new("C:\relative/path"). Much C code assumes the current platform separator, even if it's not the only one when iterating over paths.

I think java solution here was to make 'path' iteration be at directory granularity? Or path is not iterable i don't remember.

i30817 avatar Mar 24 '18 10:03 i30817

@Screwtapello

I'd expect that to be the majority use-case for dealing with user-specified paths.

I think it's highly dangerous that

path.join("foo/").join("bar")
path.join("foo").join("/bar")

do completely different things.

I don't think most people will expect that "joining" one path to an existing path can destroy their existing path.

I think a serious path implementation should clearly separate those operations, e. g.

AbsolutePath::join(&self, path: RelativePath)
AbsolutePath::replace(&self, path: AbsolutePath)
RelativePath::join(&self, path: RelativePath)

"joining" an absolute path should not even compile.

soc avatar Mar 24 '18 11:03 soc

I am just seeing this now.

First, I'd like to announce the release of 0.4.0 of path_abs which now uses "absolute" instead of "canonicalized" paths. This means that you can have a path with symlinks that may or may not exist, but PathAbs will always start with the "canonical" root directory. See the docs for more.

I'm going to just do a checklist of things from this thread and open issues that aren't covered:

File path handling

  • [x] .., ., //not being auto- handled: this is handled in v0.4.0 by the PathAbs type (which all types other than PathArc depend on). Note that .. is handled semantically (a/b/../d is always a/d), the idea being that if you wanted to actually follow the symlink then you would use canonicalize().
  • [x] relative paths are converted to absolute using env::current_dir and then resolving.
  • [x] windows gotchas, like long file paths (which require extended-length paths aka "verbatim paths" as they're called in std): this should be handled by 0.4.0, but more testing is needed. Basically all path roots are forced canonical by calling canonicalize() on their root if they are not canonical.
  • [x] absolute paths on Windows (UNC prefices): done in 0.4.0
  • [x] guaranteed serialization/deserialization of paths(using stfu8)

Edit: I missed some

  • ~~[ ] UTF-8 filenames on MacOS (the filesystem normalizes UTF-8 strings)*: I have no idea about this, I'll have to look at the ripgrep issue more (BurntSushi/ripgrep#845)~~ This looks decidedly won't fix
  • [x] atomic file writes: path_abs::{FileWrite / FileEdit} try to solve this (at least partially) by not permitting writing to a file with only an immutable reference.
  • [ ] Path.join: A method that both concatenates a path or completely replaces the existing one based on the input ... when did one ever have that use-case?: vitiral/path_abs#12

vitiral avatar Mar 24 '18 15:03 vitiral

Also, ergo_fs is related to this discussion.

vitiral avatar Mar 24 '18 15:03 vitiral

Edit: I moved my comment about weird join behavior to a path_abs issue instead:

https://github.com/vitiral/path_abs/issues/12

vitiral avatar Mar 24 '18 15:03 vitiral

The full glory of handling Windows paths: https://googleprojectzero.blogspot.de/2016/02/the-definitive-guide-on-win32-to-nt.html

TL;DR: Depending on how low-level you want the API to be, you have to handle 7 different path types.

soc avatar Mar 24 '18 18:03 soc

@soc those should all be handled by path_abs here.

If you think one is missing please open a ticket!

vitiral avatar Mar 24 '18 18:03 vitiral

I think if one makes some simplifying assumptions, like

  • restricting the valid characters to a common subset
  • not leaking the actual representation of the paths from the AbsolutePath/RelativePath types
  • making sure that invalid operations – like adding an absolute path to a relative path – are rejected by the typesystem
  • using a consistent "platform-independent" format in the struct and to serialize and deserialize such paths
  • only converting paths to the required representation (e. g. Windows) at the last possible moment when interacting with the actual Win APIs

then it's possible to get away with something that might not work for 100% of all legacy applications, but might work very well for 95% of applications to be written, i. e. saying "this crate might not be for you, if you need paths like \\asd:\...\?\.//NUL"

soc avatar Mar 24 '18 19:03 soc

restricting the valid characters to a common subset

What happens if you do this in your CLI tool and an end user files a bug report that says something like, "I tried using your tool to work with file paths like {foo}, but it returned some error about it being an invalid path, but it is a valid path on my system." (Where {foo} contains a character not in the common subset but is otherwise a valid file path on the user's system.) What do you say to this user? Or have I misunderstood your proposed restriction?

BurntSushi avatar Mar 24 '18 19:03 BurntSushi

"Use a supported name or go back to std::Path".

The point is that the use case is largely around files and paths whose name and placement is within control of the author. This is very different from tools that need to deal with every random path/file name that might exist in the file system, or is arbitrarily chosen by users.

Especially in the case of config files its desirable to have an API that sorts out interop issues from the beginning, and does not fail mysteriously when the tool is run on a different platform.

For config files it's not interesting whether the path would be permissible on one platform, because config files are probably the prime example where you absolutely want to move config files between operating systems without rewriting them.

I think it should actually be one of the core requirements of a config system to enforce that the chosen paths and file names work on all platforms.

soc avatar Mar 24 '18 19:03 soc

This is very different from tools that need to deal with every random path/file name that might exist in the file system, or is arbitrarily chosen by users.

I don't think this is a tenable requirement. A very large % of CLI applications have to deal with an arbitrary path by the user. I would hazard that most CLI applications at least accept paths as arguments (via the command line or config file), and often they directly interact with files in some way or the other.

vitiral avatar Mar 24 '18 19:03 vitiral

@soc This isn't isn't about config files? I'm confused.

BurntSushi avatar Mar 24 '18 20:03 BurntSushi