runc mount: add enhanced mount functionality to support run container in userns with host network

in the public cloud service product, serverless container running environment has some specials.

the container is often running on a separate kernel.
so the runc is work on host network mode.
the container has no privilege permission, and also has no cap_sys_admin
many data accelerate project will use mount.fuse to provide mount point for app access data

the purpose is the container running in a new userns on host network mode. the main process is using syscall open_tree to get fd for mount point sys, proc, mqueue beforce runc switch to new user ns and using move_mount to mount sys, proc, mqueue after runc switch to new user ns

Sep 27 '22 06:09 shidao1

We need test cases for this.

Sep 28 '22 18:09 kolyshkin

Also, I'm afraid you'll have to redo this once #3599 is merged, which refactors some C code in nsenter.

Sep 28 '22 22:09 kolyshkin

Also, I'm afraid you'll have to redo this once #3599 is merged, which refactors some C code in nsenter.

Thanks for reminder, we also noticed PR #3599, so open this PR:) We will wait for #3599 to settle down first.

Sep 29 '22 02:09 jiangliu

I think this PR is not active for long time, may I take handle the rest work for making this PR ready to merge? cc @AkihiroSuda

Nov 12 '23 05:11 Zheaoli

@Zheaoli You'll need to base it on top of #3985, which reworks all of the mountfd logic. I'm not sure how easy it'll be to use the new Go-based setup to implement this though. I suspect you can do it by creating a locked goroutine that joins the container's non-userns namespaces, but the slight issue is that we cannot create a procfs mount that uses the containers pidns because procfs uses the active pidns, not the for_children one (in fact, I'm not sure this PR handles procfs correctly).

Also, you don't want to use open_tree(2) like this -- a much better way is to use fsopen and fsconfig to configure the mount without touching the filesystem, and thus having an anonymous mountfd that you can then provide to the container. (To be fair, I'm not sure if the permissions work out okay with user namespaces in that case.)

Nov 16 '23 00:11 cyphar