libs-team
libs-team copied to clipboard
std::os::unix::env::{argc, argv}
Proposal
Problem statement
When making FFI calls from Rust on UNIX targets, it's common to need NUL-terminated UTF-8 strings. The same is true of NUL-terminated Widechar strings on Windows FFI calls. If these strings are obtained from environment variables or process arguments, on both UNIX and Windows targets, they already exist in the required format in the process's memory. Unfortunately, in today's Rust, there is no way in std
to access these in their original formats without paying for heap allocations, traversals, and/or syscalls.
Today in std
the only ways to access these values are via VarsOs
and ArgsOs
, both of which are iterators over OsString
values. These strings are not in the original format; they have been reallocated and had their NUL terminators dropped, meaning that further allocations and conversions are necessary to get them back into their original form.
On Windows, these allocations and conversions can be avoided through an unsafe
direct FFI call to GetCommandLineW
. There is an equivalent for this on some UNIX systems (e.g. macOS) but on others, there is no direct FFI call which exposes these. The only way to access them is through syscalls like reading /proc/self/cmdline
on Linux or sysctlbyname
on FreeBSD.
Motivating examples or use cases
I have a command-line application which:
- Reads command-line arguments and environment variables to decide what operation to do
- Also commonly reads filenames from both command-line arguments and filenames
- Also commonly passes those filenames directly to OS API calls which expect them in the original format (NUL-terminated strings in the OS's native encoding)
Today, there is no zero-cost way to access these in Rust; the lowest-cost way that's available on each of these OSes is:
-
GetCommandLineW
on Windows -
NSGetArgc
andNSGetArgv
on macOS - On Linux, read
/proc/self/cmdline
- this requires a syscall to access - On FreeBSD, use
sysctlbyname
, which also requires a syscall
Solution sketch
Introduce these OS-specific functions to a new module, std::os::unix::env
:
fn argc() -> usize;
fn argv() -> *const *const c_char;
These functions would read from these atomics, which is why they do not need to take &self
.
Today, these atomics are not exposed, and there is no direct FFI-based workaround to access the values they hold. That's in part because they rely on non-standard link_section
extensions. So there's no way to write a crate in userspace for these today.
For symmetry, it would seem reasonable to introduce this function to a new OS-specific module, std::os::windows::env
:
fn args_widechar() -> *const *const u16;
This would be implemented as a call to GetCommandLineW
, and would only be there for symmetry with the proposed std::os::windows::env
, so that Windows programs didn't need to do FFI to do something that UNIX programs could use using std
.
Alternatives
These functions could use CStr
over *const c_char
, but then they would have to be unsafe
because CStr
requires that the pointers be non-null, which is not a guarantee in this case. Additionally, since the motivation for this is FFI, the CStr
s would likely need to be converted into *const c_char
s anyway, so overall CStr
seems both unsafe and unhelpful here.
It might sound reasonable to have a function which returns a slice instead of separate functions for argc
and argv
. However, as a comment in the current UNIX args implementation notes, argc
is not necessarily an accurate length for argv
, meaning that building a safe slice would require traversing the argv
until a null pointer is encountered—which would be undesirable given that the motivation for this use case is to avoid overhead.
As an alternative, it could make sense to have an Iterator
which iterates over argv
until it encounters a null, and uses argc
for a size_hint
only. Another alternative would be to use Option<NonNull<...>>
instead of const *
, to emphasize that all the pointers could be nullable. However, in FFI use cases, the FFI APIs will be asking for raw pointers, so having access to the raw pointers is more helpful than having an Option<NonNull<...>>
and especially an iterator.
So it seems like the minimal proposal here would be to expose the raw pointers, and then optionally an iterator convenience method could be discussed on top of that.
Links and related work
There are various OS-specific functions in std::os
already, like std::os::unix::fs::chown
.
Related threads:
- https://users.rust-lang.org/t/direct-access-to-argc-and-argv/99475/1
- https://internals.rust-lang.org/t/pre-rfc-std-argc-argv/20086
related, on Linux at least, common programs (e.g. sshd
) are known to write over their argv
strings since that's how they change what name they show up as in the list of processes.
Note that if argv
is mutated while Rust is collecting the arguments into a Vec
, then bad things can happen. On platforms that don't allow getting argv
/argc
except via main
this is currently mitigated by std keeping them to itself (i.e. it takes ownership).
If exposing these publicly, we would at a minimum want to strongly warn against mutating globally shared resources.
If exposing these publicly, we would at a minimum want to strongly warn against mutating globally shared resources.
Typing argv
as *const *const c_char
IMHO already suggests these aren't mutable references. A docs note is still useful though, but only so people know argc != (argv..).take_while(|p| !p.is_null()).count()
as per that above docs comment.
We disscussed this in last week's libs-api meeting, but we didn't reach a consensus.
The main argument against adding these is the unclear ownership of the data these pointers point at.
Should argv()
return *mut *mut
(rather than *const *const
) to match the type in C, since one of the possible use cases is overwriting the data (as mentioned by @programmerjake)? In that case, how could we document the safety requirements? Would we guarantee it's fine unless std::env::args[_os]
is used? Is that future proof?
It might seem like this can all be avoided by making argv()
return *const *const
(as proposed in this ACP), to make it clear these are not mutable (as also suggested by @dead-claudia). However, that would prevent us from ever adding something like std::env::set_process_name()
(or std::os::linux::set_process_name()
or whatever), since that could race with any use of those *const
argv pointers.
One way of looking at it, std
has basically "taken ownership" of argc+argv. Perhaps it'd be cleaner to have a way to have a way to release ownership or to intercept them before it takes ownership. At least then the ownership story is clearer.
In the meeting we were wondering if your problem could be solved using a (future) language feature that allows writing your own (C-style) entry point that takes the original argc
and argv
from libc (the entry point that is normally provided by std
that then calls your main
). Then you'd be able to do with those argc+argv whatever you want and pass it on to something like std::initialize_runtime
after you're done with them, passng on ownership to std
.
I personally think that having std::os::unix::env::{argc, argv}
as proposed is fine, as long as we find a way to clearly document when these can be used safely. I guess they can never be used safely within a (safe) library, since it cannot know what other threads are doing. I'd be curious to see what the safety documentation on argv()
would look like.
In the meeting we were wondering if your problem could be solved using a (future) language feature that allows writing your own (C-style) entry point that takes the original
argc
andargv
from libc (the entry point that is normally provided by std that then calls your main). Then you'd be able to do with those argc+argv whatever you want and pass it on to something likestd::initialize_runtime
after you're done with them, passng on ownership tostd
.
For my use case, that would work great! I kind of assumed something like that would be an unreasonably large change to propose. 😄
If I understand correctly, that design would also work with no_std
, yeah? In that you'd just write main
that way and then decline to run std::initialize_runtime
(since it wouldn't be available).
You can already write your own C main
function on stable with the #![no_main]
attribute:
#![no_main]
#[no_mangle]
extern "C" fn main(argc: c_int, argv: *mut *mut c_char) -> c_int {
0
}
The only downside is that your skip some of the initialization code normally run by the standard library, but this initialization code is optional (Rust shared libraries work fine without this code).
The only downside is that your skip some of the initialization code normally run by the standard library, but this initialization code is optional (Rust shared libraries work fine without this code).
Yeah, unfortunately executables do need it (at least as far as I know!)
Executables do not need to use the standard library entry point. See https://github.com/rust-lang/rust/blob/d31b6fb8c06b43536ac5be38462d2a55784e2199/library/std/src/sys/pal/unix/mod.rs#L43 if you're interested in what it does on *nix platforms.
Minor Windows note: I think the type would be fn args_widechar() -> *const u16
, because Windows command-line args are a single string. Splitting them into C-style argc/argv is performed by CommandLineToArgvW()
if desired.