youki
youki copied to clipboard
Fix --preserve-fds, eliminate stray FD being passed into container
https://github.com/containers/youki/pull/2663 seems to have broken preserve-fds by unconditionally closing everything aside from stdio in the container main process.
I've fixed this by effectively reverting part of that PR. Typically I'd expect it to be a bad idea to revert security fixes, but unfortunately I don't really understand much of the diagnosis at https://github.com/containers/youki/pull/2663 and can't get it to stack up against my observations of previous and current behavior of youki.
Specifically:
youki does leak a directory file descriptor from the host mount namespace, but it just so happens that the directory is the rootfs of the container
I checked out 04f8f2d79d4355af30294b5c581fe040ed96aa42 (the commit before the fix PR was merged) and couldn't reproduce this. With a default config generated by youki spec and a command of "ls", "-al", "/proc/self/fd" I get the following output:
dr-x------ 2 root root 0 Aug 24 20:17 .
dr-xr-xr-x 9 root root 0 Aug 24 20:17 ..
lrwx------ 1 root root 64 Aug 24 20:17 0 -> /dev/pts/9
lrwx------ 1 root root 64 Aug 24 20:17 1 -> /dev/pts/9
lrwx------ 1 root root 64 Aug 24 20:17 2 -> /dev/pts/9
ls: /proc/self/fd/3: cannot read link: No such file or directory
lr-x------ 1 root root 64 Aug 24 20:17 3
This is what I'd expect - preserve fds (https://github.com/containers/youki/pull/177) already closes all the FDs in the init process before execution. FD 3 here is ls opening the directory.
I can reproduce a leak by passing --preserve-fds 10 (I get something at FD 7, which seems to be the rootfs being referenced) - but the fix solves this by just making the preserve-fds flag useless. I fix this by actually closing the culprit directory, and then additionally verifying by hand that passing --preserve-fds 100 does not list any other stray FDs that are being passed down.
In addition, no care is taken to make sure all non-stdio files are O_CLOEXEC
Again I don't understand this - the preserve-fds PR does exactly that by marking everything as O_CLOEXEC before process execution? It is not ideal that passing --preserve-fds leaks something, but I've fixed that in this PR in a more precise way.
In addition, any file descriptors passed to youki are not closed until the container process is executed, meaning that easily-overlooked programming errors by users of youki can lead to these attacks becoming exploitable.
I also don't understand this - the preserve fds PR sets the O_CLOEXEC flag in the init process, before execing the user process. It seems better to do this as late as possible, so fewer file descriptors can slip in? The vulnerability fix puts it in the main process, giving ample opportunity for other FDs to be opened and leaked.
Perhaps there's more detail that would help my confusion in the review in the private repository that would help? Per:
This PR has already been approved by jprendes rumpl yihuaf in private repository.
Edit: oops, sorry for the mentions, those were unintentional
@aidanhs , failing oci CI -
Running default/default.t
failed to create the container
ERROR libcontainer::process::container_init_process: failed to cleanup extra fds err=Nix(EPERM)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
ERROR youki: error in executing command: exec process failed with error error in executing process : failed syscall
Stack backtrace:
0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
error in executing command: exec process failed with error error in executing process : failed syscall
Stack backtrace:
0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
Error: exec process failed with error error in executing process : failed syscall
Stack backtrace:
0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
exit status 1
error: Recipe `test-oci` failed on line 54 with exit code 1
The failure was a problem with some changes that I made to try and make closing FDs happen later - unfortunately seccomp got in the way. I've now backed out those changes as strictly speaking they were unnecessary.
Have run the tests locally and they seem to pass.
There are now three integration tests:
- that youki itself is not leaking FDs (i.e. guarding against a hypothetical future vulnerability)
- that preserve-fds actually works
- that open FDs are not passed down without preserve-fds
This was harder than I expected because:
- I needed to extend the test harness to permit customisation of arguments to container create (I did a small amount of tidying here where another test wanted to do this too, but I didn't want to get too distracted)
- because this testing is sensitive to open FDs, it cannot be run in parallel with other tests - I have had to extend the test harness to support non-parallel execution
- the
--preserve-fdsflag does not get passed via the spec so I needed to communicate state via the filesystem
@aidanhs May I ask you to resolve the conflicts? Others look good to me.
Hey, I'll need some time to check the changes done in test functions. probably till this weekend.
Also a recent merged PR extends the test_inside_container function to allow passing extra args, I think @aidanhs you should rebase/merge main resolve conflicts and use the updated function to pass the extra args.
Have rebased and now use the new CreateOptions struct.
Hey just a couple of comments here :
- @aidanhs , thanks for following up on this! There was a minor conflict, which I have fixed and pushed, please check once that my last two commits are fine.
- @utam0k I feel this is ok to merge, but given that this is related to security code we did a while back (even though I agree with the argument in the description) if you want to do a more careful scrutiny of this, we can merge this after we release 0.5.0. Right now I have done a basic check for the logic change, and checked the tests.
- Finally for myself - we really need to refactor the test running logic here. That is definitely out of scope for this PR, but we need to clean things up, probably remove the crossbeam completely now that rust natively supports scoped threads. Also there are a lot of extra println which clutter the logs, a couple of tests should be conditional test and are not, and a couple of tests are "leaking" resources on host system which causes issues when re-running the test. I need to do a better job at reviewing and fix these issues up.
The two commits look fine.
Personally I'd have left in the TODO at https://github.com/youki-dev/youki/pull/2893/commits/f302ea0caa8d59f7518fe0173e824114297287d4#diff-d45900789a82e7875774e0c4bdae819850026ea66be9d25fa96a4b5d736ecdf7L21, modifying it to say "the CreateOptions argument could be used to pass the pidfile" - I added that comment because I saw the test could be cleaned up to use existing test functions, but I didn't want to spend effort on it for this PR so a note for future visitors.
But I don't mind either way!
I feel this is ok to merge, but given that this is related to security code we did a while back (even though I agree with the argument in the description) if you want to do a more careful scrutiny of this, we can merge this after we release 0.5.0. Right now I have done a basic check for the logic change, and checked the tests.
@yihuaf Could you mind double-checking this PR for sure?
I also don't understand this - the preserve fds PR sets the O_CLOEXEC flag in the init process, before execing the user process. It seems better to do this as late as possible, so fewer file descriptors can slip in?
You are correct. The preserve-fds logic should be execute as late as possible, because we want to cover all the cases where fds can be opened during all the previous setup action.
Further, your analysis is correct. The preserve-fds logic works correctly before the fix. Leaking the newroot fds when pivot_root is valid and this PR fixed it. If a caller pass in preserve-fds that allows the problematic fds to be leaked, I don't think we need to guard against the situation (and am not sure if it is possible).
Thanks all!