wasi-filesystem
wasi-filesystem copied to clipboard
Supporting an "initial current directory"
Many applications today (or at least many CLI apps) rely on the idea of a "current directory" of the calling process (e.g. getcwd
and chdir
). Currently, though, WASI doesn't define what it means to have a current directory and implementations like wasi-libc don't have a getcwd
symbol. For ease of porting applications, however, I think it might be good to support this concept.
I'm not entirely sure if this actually needs to manifest itself as new WASI APIs, however. They're even more low level than libc typically is and it may be possible to get away with having the concept of a current directory being entirely within wasi-libc. I wanted to open the issue here, though, to see if others felt the same and have some discussion with respect to WASI itself rather than just wasi-libc.
One idea I've got is that wasi-libc could interpret preopened paths as either absolute (starting with /
) or relative to the root (those that don't start with /
). Next a new syscall would be added:
(module $wasi_ephemeral_proc
;;; Returns the initial current directory of the application
(@interface func (export "current_dir")
(result $path string)
)
)
And then wasi-libc
would contain emulation of getcwd
and chdir
as necessary. This would allow applications which want to print paths relative to the current directory to be able to print values appropriately and applications could also be started in arbitrary locations as decided by the embedder.
In any case I'm curious if others have thought about this as well, and if there's other interest in supporting this as well.
Can we treat this as two separate issues:
- Can we add fake/userspace concept of PWD to wasi-libc?
- Can we seed this PWD on startup from the environment?
I think just fixing (1) will have a lot of benefits on its own. TBH I'm not sure why we don't already have this concept as part of the pre-open code which emulates relative paths.
We can consider (2) separately. We could just use $PWD
, although we have avoided environment variables where possible so this seems contrary our design guidelines. I wonder if we could avoid adding a new syscall by defining some kind of convention based on the exiting pre-open API. Maybe the first pre-open is always the initial PWD? Could we just start doing this in wasi-libc without any WASI API changes?
We do have a very minimal concept of a PWD. If you have a relative path, the preopen code in wasi-libc will look for a preopen named "." and use the associated handle as the base. One simple (though not strictly POSIX-conforming) implementation of getcwd
that would likely work in a lot of cases would be to just return ".".
If we want chdir
too, I also agree that we could probably get quite far with an emulated version in libc. I imagine libc would maintain global state holding a string and a file descriptor, which would default to "." and -1, and then when you call chdir
with an absolute path, it does a preopen lookup and stores the resulting relative path and file descriptor in them. chdir
on a relative path would update the relative path accordingly and leave the file descriptor unmodified. We could even support fchdir
, by setting the state to "." and the given file descriptor.
AFAIK most usage in Rust of the current directory is used to make relative paths absolute, so I don't think we can return .
from this function. (apparently the online specification also says it returns an absolute path). This is where I think we'll have to default it to something (and /
I think may be the best bet?). If we add the concept of a current directory though we'll also need, I think, to define what it means for preopened paths to be relative, especially if you can chdir around.
I mostly wanted to open this issue to see if others had different ideas for how to implement this. I agree that if we pick a strategy where it's "mostly emulated" then we can split this into a possible new API addition and a wasi-libc issue. I do think we want to put as much of this as we can into wasi-libc, though.
WASI is attempting to hide host absolute paths from applications. People often do expose them today, because it's one of the most convenient ways to use preopens in practice today, though it isn't necessary -- you can use things like --mapdir= in some wasm engines to hide host paths.
Often the reason programs canonicalize paths to absolute is to send those paths to other programs with different cwds. For those use cases, just using /
or making up a fake path like /cwd
or so wouldn't be sufficient, because it wouldn't be the real host path.
Eventually, we'll have ways for WASI instances to spawn new WASI instances and pass them files and directories and such, so the question is, can we get by without requiring programs to know the actual host cwd?
Hm so my main point is that applications today rely on getcwd and such, and the lack of support for this today is a hurdle to overcome when porting programs to WASI. Without a standardized solution each application will end up making specific solutions that aren't necessarily compatible, so I think given the prevalence of this it would be worthwhile to standardize something.
I don't think there's any need to expose absolute host paths or even the real host cwd to applications. I imagine that setting up a WASI execution you'll basically always set up a virtual filesystem with mappings from the host to the guest, and there's no need for any of them to actually match.
maybe it would make sense to establish cwd as some sort of standardized preopen (not required to be present, of course)
Ah, if you're doing WASI parent to WASI child, and sharing the same virtual mapping, then yes, something like this could make sense. And I agree with @devsnek and other comments above, that this does sound closely related to preopens, so that's a reasonable place to start.
Isn't step 1 just making getcwd(), chdir() work as expected and having that be honored by open/stat/etc when passed relative directories. This can be done with the default starting PWD just being "/" (in the virtual filesystem).
Then step 2 would be to decide how we might want to specify the default starting directory? It might be that most users are happy with just step 1.
I agree. The algorithm I sketched out above would be a good place to start, if anyone's interested. Adding in an ability to have a starting current directory other than "." would be natural to add on top of that.
I've taken an initial stab at emulation at https://github.com/WebAssembly/wasi-libc/pull/214 for wasi-libc.
One use-case for WASI is for shipping CLI tools. Imagine an existing cross-platform C++ codebase. We can ship native binary packages for various Linux-es, OS X, Windows and what else. Or we can build and ship a single WASM file. Looks like a clear win, doesn't it?
For this to work, the behaviour of a WASI build should be as close as possible to the native build. The proposed emulation approach creates several behavioural differences:
- native programs have working directory regardless of
PWD
; -
getcwd
reflects the up to date path even if the directory was renamed in meantime; -
getcwd
returns a path with symbolic links expanded, emulation will suffer from TOCTOU.
Therefore I propose to extend WASI with explicit chdir
and getcwd
.
path_open
could use AT_FDCWD
(or similar special value) in place of a file descriptor to refer to the current directory (as openat does).
Alternatively, an API to retrieve the current file path from a directory descriptor could work, but it is tricky to implement on some platforms.
One use-case for WASI is for shipping CLI tools. Imagine an existing cross-platform C++ codebase. We can ship native binary packages for various Linux-es, OS X, Windows and what else. Or we can build and ship a single WASM file. Looks like a clear win, doesn't it?
Yep! We have a bunch of work to do for it to be a clear win in practical terms, but we want WASI to support great CLI tools.
WebAssembly/wasi-libc#214 is a first step to chdir
and getcwd
support. It doesn't yet resolve symlinks, and doesn't have AT_FDCWD
yet, and there may be some subtleties about how chdir
works that we'll need to iterate on, but we can do those in followup steps. I think we'll also be able to add realpath
support.
@sunfishcode It's great that wasi-libc
is taking steps towards being more POSIX-compatible. I am also happy to learn that CLI tool support is considered important.
While incremental approach is usually a good thing, I am concerned that it might result in hard to debug problems for the end users. CLI tool authors, especially when coming from the context of native development, are unlikely to be aware of the limitations in wasi-libc
. They are going to find that chdir
is supported and appears to be working. They are unlikely to bother checking obscure corner cases.
I'm afraid that the difference between native filesystem semantics and WASI is going to cause pain for end users, developers and will ultimately hurt the adoption.
Realising that WASI is work in progress, I'd like to raise awareness early on that the current WASI API might need extensions in order to better match the native filesystem semantics.
Bellow I highlight challenges in a 'pure' libc implementation on top of the current WASI.
Current working directory
A native program has working directory regardless of PWD; getcwd
might yield something very different from PWD
as the snippet below demonstrates:
$ mkdir -p /tmp/a/b/c &&
ln -fs a/b/c /tmp/d &&
cd /tmp/d &&
mv /tmp/d /tmp/e &&
python -c 'import os;print(os.getenv("PWD"),os.getcwd())' &&
stat /tmp/d
('/tmp/d', '/tmp/a/b/c')
stat: cannot stat '/tmp/d': No such file or directory
The path stored in PWD
might even no longer be valid.
I believe that WASI should provide a robust method to get the current working directory.
It is necessary to define what happens if the current working directory is not within a mapped subtree.
Concurrent modifications to the filesystem
Filesystem is a shared resource therefore unrelated programs could make changes to the filesystem concurrently, including renaming our current working directory. Native getcwd
yields an up to date path.
A getcwd
emulation returning the current woking directory path stored internally (updated on chdir
) will break the semantics.
Another option is to compute the current working directory path dynamically. The idea is to iteratively open the parent directory, iterate dentries, and match name by inode number. This is less efficient of course; dirent
lacks device id (filestat
has it, hence path_filestat_get
is necessary).
Wasmer currently returns 0 for device id/inode number. I'm unsure if exposing host device id/inode number is acceptable privacy-wise. Scrambling these IDs in a WASM runtime is definitely non-trivial.
To conclude, it looks like we can't currently emulate getcwd
to return an up to date path.
Symlinks in current working directory path
Native getcwd
returns path with symlinks expanded. We don't have this issue if the path is uncached.
If we choose to cache the path and update it on chdir
, it is necessary to expand symlinks manually. This is feasible by processing one path component at a time, opening next directory with O_NOFOLLOW
. If open fails then we know that a path component is a symlink, perform readlink
and recurse. It is possible to make it robust even when filesystem is modified concurrently, but the implementation is definitely involved.
It will increase the binary size for virtually any CLI tools working with the filesystem. Hence WASI-level support is definitely desired.
To summarise, I believe that WASI needs extensions to support current working directory in a way compatible with native filesystem semantics. This is important for WASI to be a compelling target for building CLI tools. Filesystem is a basic service which most CLI tools need.
Filesystem is a shared resource therefore unrelated programs could make changes to the filesystem concurrently, including renaming our current working directory. Native
getcwd
yields an up to date path.
One underlying observation here is that any program depending on getcwd
to yield an up to date path already has a TOCTOU problem: if other programs can rename the current working directory at any time, it means they could also rename it between a call to getcwd
and any use of the resulting path.
POSIX itself recognizes the limitations of APIs like this, and added the openat
family of functions to provide a robust alternative: open the directory, get a file descriptor, and use that, because file descriptors avoid TOCTOU errors in ways that string-based APIs like getcwd
don't. With this, and the fact that "current working directory" implies process-wide mutable state, WASI's current approach is to work towards supporting chdir
and getcwd
in userspace, because they're important for compatibility, but not to bake them into the underlying system interface.
When using a shell renaming the current directory should show the new path when printing the prompt the next time. While the prompt may temporarily be outdated. The next time it is drawn, the new name prevents potential confusious by the user. When deleting the current directory this would also automatically append "(deleted)" to the shown path.
tl;dr: While it doesn't help much with TOCTOU errors, it does help a lot with visual presentation to the user.
@bjorn3
My sense is that shells are special. The assumption that directories-as-capabilities roughly lines up with how applications tend utilize filesystems doesn't hold for shells, which want to be able to roam freely about the namespace. Shells also want to interact with the host in very particular ways, eg. in the handling of SIGHUP
. So my suggestion here is that we think about shells separately.
FWIW, I just tried bash
, zsh
, fish
, and nu
, and none of them had that behavior, at least out of the box :-}.
@bjorn3 @sunfishcode Thank you for the comments.
I've been checking other threads (ex. https://github.com/WebAssembly/WASI/issues/109). My takeaway was that
- WASI is fond of capabilities-based access to the filesystem;
- WASI should make implementing POSIX possible, but other semantics are also useful and this is something for the embedder to decide. Therefore WASI doesn't specify how WASI-flavoured POSIX should behave. It doesn't specifically optimise for making implementing POSIX easy. Runtimes are free to figure out reasonable semantics. While standardisation in this area is desired, it is not a WASI goal. (Ex: In Wasmer executables installed by the package manager come with a manifest; the author can request
/
and.
to be mapped, creating a POSIX-like environment.)
The following proposal might be better aligned with the project goals:
-
If the embedder finds the concept of the current working directory useful, she provides it via a pre-opened directory. How the fd number is discovered is left unspecified (could be via environment);
-
New
fd_path_get()
API is used to discover the file path given any open directory descriptor.
Pros: in line with WASI design goals, makes implementing chdir
, fchdir
and getcwd
easy.
Thoughts?
If some embedders choose to offer baked-in current-working-directory features, and every program that calls getcwd
is compiled into code that depends on them, or if programs come to depend on finding a preopen for /
, then programs won't work in embedders who don't chose to offer them, and it won't end up being a realistic choice. WASI isn't meant to be just a framework from which embedders create their own ABIs, it's also meant to be a common ABI across embedders.
It would be very helpful if you could describe specific functionality that depends on getcwd
, chdir
, and fchdir
that you need here. If you need TOCTOU safety, then the observation is that getcwd
isn't a robust solution anyway, and it's worth looking into openat
, which WASI already supports. If you have many wasm modules and are concerned about code being duplicated across all of them, then when dynamic linking is available it'll provide a much more comprehensive solution.
@sunfishcode
It would be very helpful if you could describe specific functionality that depends on getcwd, chdir, and fchdir that you need here.
I want to share a C compiler with custom language extensions with the widest audience possible (https://github.com/rapidlua/barebone-c). In order to have any adoption, I need to ship prebuilt packages. WASI makes it significantly more convenient for me. Ideally, the WASI build of the tool should behave exactly the same as the native one does. Current working directory must be defined, relative and absolute paths must work. People don't typically rename directories from under a compiler running, therefore this particular difference might be insignificant after all.
I'd like to take a step back and generalise a bit. Supposedly, WASI is a compelling target for parties shipping CLI tools and unwilling to build multiple packages for a plethora of OS-es and hardware architectures. The major reason for Emscripten's success was that little to no changes were required to the source code. People shouldn't need to rewrite their code in order to benefit from WASI. In case of complex projects like LLVM, it's hard to judge what effects the slightly different semantics in the platform API will cause.
Therefore I feel that it is important to have WASI libc being as POSIX-ly compatible as possible. Personally, I don't need fchdir
. But it is in POSIX hence should be eventually supported. With the current state of WASI it is complicated (consider getcwd
after fchdir
).
I am not particularly concerned with the code size. The reason I was linking to the other thread was to show that I'm actually making effort to understand WASI agenda.
WASI isn't meant to be just a framework from which embedders create their own ABIs, it's also meant to be a common ABI across embedders.
This complicates matters. What if the host's current working directory is not mapped?
You've mentioned that WASI already has openat
. So what if I have /a
and /a/b/c
mapped from different host directories, and the file descriptor refers to a
. Should openat(fd,"b/c")
succeed? Imagine the amount of extra bookkeeping necessary to make it work if directories could get renamed (even by the program itself).
It would be very helpful if you could describe specific functionality that depends on getcwd, chdir, and fchdir that you need here.
I want to share a C compiler with custom language extensions with the widest audience possible (https://github.com/rapidlua/barebone-c). In order to have any adoption, I need to ship prebuilt packages. WASI makes it significantly more convenient for me. Ideally, the WASI build of the tool should behave exactly the same as the native one does. Current working directory must be defined, relative and absolute paths must work. People don't typically rename directories from under a compiler running, therefore this particular difference might be insignificant after all.
Thanks! Current working directory is being worked on, and relative and absolute paths already work in C/C++ APIs. One missing area if you want to run clang is fork
/exec
. I expect we will add APIs to spawn child processes to WASI, but they're not available yet.
I'd like to take a step back and generalise a bit. Supposedly, WASI is a compelling target for parties shipping CLI tools and unwilling to build multiple packages for a plethora of OS-es and hardware architectures. The major reason for Emscripten's success was that little to no changes were required to the source code. People shouldn't need to rewrite their code in order to benefit from WASI. In case of complex projects like LLVM, it's hard to judge what effects the slightly different semantics in the platform API will cause.
LLVM is an interesting example; it makes extensive use of #ifdef
s to customize its behavior for a plethora of OS's. Compiling to Wasm throws away all of this hand-tuned battle-tested porting and optimization work, and puts the responsibility on WASI to do the same work with far less information about the program's intent. We can (and do) smooth over many OS differences, but without knowing what the program's intent, full emulation can be prohibitively expensive. If slightly differing filesystem semantics are something you're concerned about, there's plenty to be concerned about in any Wasm-based approach.
It really helps us to hear real-world use cases, to help us make decisions about how best to support various features. If you want to do something and it doesn't work, isn't efficient enough in some setting, or isn't robust enough, we'd like to hear about it.
Therefore I feel that it is important to have WASI libc being as POSIX-ly compatible as possible. Personally, I don't need fchdir. But it is in POSIX hence should be eventually supported. With the current state of WASI it is complicated (consider getcwd after fchdir).
I expect it will be possible to add fchdir
support, including fchdir
followed by getcwd
. There are some tradeoffs involved, so we're interested to hear from real-world use cases to guide these decisions.
What if the host's current working directory is not mapped?
POSIX says that getcwd
sets ENOENT
or EACCES
if the directory is removed or not accessible.
You've mentioned that WASI already has openat. So what if I have /a and /a/b/c mapped from different host directories, and the file descriptor refers to a. Should openat(fd,"b/c") succeed? Imagine the amount of extra bookkeeping necessary to make it work if directories could get renamed (even by the program itself).
Yes, path resolution can cross "mount points".
We won't need any extra bookkeeping to support directories being renamed by other programs, because we hold file descriptors for our open directories which are stable across renames. I have a pretty good idea of what bookkeeping we'll need if we need to support programs renaming directories they they themselves have open and then chdir
'ing into them.
With https://github.com/WebAssembly/wasi-libc/pull/214, wasi-libc now has basic emulation of getcwd
/chdir
.
I believe all of the questions here have been answered, but feel free to open up follow-up issues if there are other things to address.
Oh, if I knew this was planned, I would not have added this workaround to boost::filesystem. (Mainly commenting for the benefit of people who look at both of these issues in the future.)
I got a feature request in https://github.com/GoogleChromeLabs/wasi-fs-access/issues/2 to use the current working directory functionality.
In general, https://github.com/WebAssembly/wasi-libc/pull/214/files looks promising, but my understanding is that the current directory emulation is purely internal to the wasi-libc
?
The problem is, on https://wasi.rreverser.com/ I'm running each command in a separate short-lived Wasm instance - this allows to use coreutils
as-is (single command per invocation is how it works on other platforms too) and also this approach allows to control the terminal from JS rather than giving control over to the coreutils
binary and handling terminal sequences from Wasm side.
This means that, even if some command changes a current directory, I have no way of reading it back from a Wasm instance, nor any way of setting it as a current directory for the next one (as it always starts out with /
).
So, I guess, the request for WASI itself to support "current directory" still stands - can we add syscalls that would allow saving and reusing cwd as part of the implementors' global state, rather than keep it limited to a Wasm instance?
@RReverser I can understand wanting to set the initial working directory to something other then "/" for new modules.
However reading the current working directory back out of a child process I don't think is needed for POSIX-like environments. There are no situations that I know of where a chdir
in a child process has any effect on a parent process.
However reading the current working directory back out of a child process I don't think is needed for POSIX-like environments.
I guess that's fair, if we consider only POSIX-like use-cases.
I'd be actually okay with not having this part - I already special-case some commands in the emulator, and, indeed, would have to do the same for cd
anyway, since it's not part of the coreutils
but more of a shell command.
However, I can't think of a way around this part:
@RReverser I can understand wanting to set the initial working directory to something other then "/" for new modules.
You mentioned this yourself, above, but just to resurface - maybe wasi-libc could read PWD
from environment table upon startup and chdir to that (or keep /
if PWD
is unset)?
That sounds like it would work. We have been trying not rely on the environment as much as possible for core functionality. If we could find some way to make it opt-in so that not all libc-based programs would end up depending on getenv I think it could be acceptable.
Alternatively perhaps we could add new preopen type. Right now we only have one: __WASI_PREOPENTYPE_DIR. We could perhaps add __WASI_PREOPENTYPE_PWD?
(regarding cd
, yes that is one of those things like set
and unset
that are impossible to write (on UNIX anyway) as separate programs and are required to be shell builtins)
Alternatively perhaps we could add new preopen type. Right now we only have one: __WASI_PREOPENTYPE_DIR. We could perhaps add __WASI_PREOPENTYPE_PWD?
That could work nicely, yeah, and we could even reuse existing types and functions for getting dir length & contents.
Should we reopen this issue for tracking & discussion for now?
Could you describe the use cases for this in more detail? https://github.com/GoogleChromeLabs/wasi-fs-access/issues/2 doesn't have much detail, and I'd like to understand how you envision programs would use this.