Runtime fails to mount /sys when --tpuproxy is provided
Description
I'm testing out TPU support with the runsc docker shim. When I use runsc normally, everything works fine, but when used with --tpuproxy, it fails to mount /sys. This is surprising to me because the mount is definitely there.
cc @thundergolfer
Steps to reproduce
I've configured docker to use my custom runsc script:
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ cat /etc/docker/daemon.json
{
"bip": "169.254.123.1/24",
"runtimes": {
"runsc": {
"path": "/home/peyton/tputesting/runsc-wrapper.sh"
}
}
}
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ cat runsc-wrapper.sh
#!/bin/bash
exec /usr/local/bin/runsc --tpuproxy "$@"
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ docker run --rm --runtime=runsc busybox echo "Hello from busybox"
docker: Error response from daemon: OCI runtime start failed: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true 0xc000871710} false}: unknown.
This happens despite the mount existing:
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ mount | grep sys
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755,inode64)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=24590)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
I'm using a v5lite tpu:
TPU type v5litepod-1
TPU software version tpu-vm-base
runsc version
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ runsc -version
runsc version release-20240807.0
spec: 1.1.0-rc.1
docker version (if using docker)
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ docker version
Client: Docker Engine - Community
Version: 20.10.16
API version: 1.41
Go version: go1.17.10
Git commit: aa7e414
Built: Thu May 12 09:17:23 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.16
API version: 1.41 (minimum version 1.12)
Go version: go1.17.10
Git commit: f756502
Built: Thu May 12 09:15:28 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.4
GitCommit: 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
runc:
Version: 1.1.1
GitCommit: v1.1.1-0-g52de29d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
uname
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ uname -a
Linux t1v-n-901fc2b8-w-0 5.13.0-1027-gcp #32~20.04.1-Ubuntu SMP Thu May 26 10:53:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
runsc debug logs (if available)
I0819 18:02:32.801671 23344 main.go:201] **************** gVisor ****************
D0819 18:02:32.801687 23344 state_file.go:77] Load container, rootDir: "/var/run/docker/runtime-runc/moby", id: {SandboxID: ContainerID:a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0}, opts: {Exact:false SkipCheck:false TryLock:false RootContainer:false}
D0819 18:02:32.802692 23344 sandbox.go:1891] ContainerRuntimeState, sandbox: "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0", cid: "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0"
D0819 18:02:32.802708 23344 sandbox.go:688] Connecting to sandbox "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0"
D0819 18:02:32.802767 23344 urpc.go:571] urpc: successfully marshalled 124 bytes.
D0819 18:02:32.803072 23344 urpc.go:614] urpc: unmarshal success.
D0819 18:02:32.803098 23344 sandbox.go:1896] ContainerRuntimeState, sandbox: "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0", cid: "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0", state: 1
W0819 18:02:32.803327 23344 specutils.go:123] AppArmor profile "docker-default" is being ignored
W0819 18:02:32.803340 23344 specutils.go:129] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
D0819 18:02:32.803350 23344 container.go:427] Start container, cid: a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0
D0819 18:02:32.803364 23344 sandbox.go:394] Start root sandbox "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0", PID: 23277
D0819 18:02:32.803370 23344 sandbox.go:688] Connecting to sandbox "a7ffa53188c9b460eb73a41146e8654bb60532ec9c88cbb8769055edf2003ed0"
I0819 18:02:32.803393 23344 network.go:55] Setting up network
I0819 18:02:32.803452 23344 namespace.go:108] Applying namespace network at path "/proc/23277/ns/net"
D0819 18:02:32.803721 23344 network.go:300] Setting up network channels
D0819 18:02:32.803737 23344 network.go:303] Creating Channel 0
D0819 18:02:32.803755 23344 network.go:334] Setting up network, config: {FilePayload:{Files:[0xc000350e00]} LoopbackLinks:[{Name:lo Addresses:[127.0.0.1/8] Routes:[{Destination:{IP:127.0.0.0 Mask:ff000000} Gateway:<nil>}] GVisorGRO:false}] FDBasedLinks:[{Name:eth0 InterfaceIndex:0 MTU:1500 Addresses:[169.254.123.3/24] Routes:[{Destination:{IP:169.254.123.0 Mask:ffffff00} Gateway:<nil>}] GSOMaxSize:65536 GVisorGSOEnabled:false GVisorGRO:false TXChecksumOffload:false RXChecksumOffload:true LinkAddress:02:42:a9:fe:7b:03 QDisc:fifo Neighbors:[] NumChannels:1 ProcessorsPerChannel:0}] XDPLinks:[] Defaultv4Gateway:{Route:{Destination:{IP:0.0.0.0 Mask:00000000000000000000ffff00000000} Gateway:169.254.123.1} Name:eth0} Defaultv6Gateway:{Route:{Destination:{IP:<nil> Mask:<nil>} Gateway:<nil>} Name:} PCAP:false LogPackets:false NATBlob:false DisconnectOk:false}
D0819 18:02:32.803914 23344 urpc.go:571] urpc: successfully marshalled 946 bytes.
D0819 18:02:32.805693 23344 urpc.go:614] urpc: unmarshal success.
I0819 18:02:32.805706 23344 namespace.go:129] Restoring namespace network
D0819 18:02:32.805727 23344 urpc.go:571] urpc: successfully marshalled 112 bytes.
D0819 18:02:32.818442 23344 urpc.go:614] urpc: unmarshal success.
W0819 18:02:32.818469 23344 util.go:64] FATAL ERROR: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true 0xc00071bad0} false}
Hi Peyton, thanks for reporting this bug. The error you're seeing likely happening while building the internal gViosr sysfs, not because /sys doesn't exist on the host. When --tpuproxy is enabled the sandbox builds a mirror of the host PCI directories located in sysfs. The userspace tpu driver relies on the presence of these files to get information about the TPU hardware (version, topology, etc) running on the host. Can you show me what you get when you run ls -l /sys/bus/pci/devices in your VM?
iirc, you can't run tpuproxy via exec /usr/local/bin/runsc --tpuproxy "$@"
the similar command which works for nvproxy because nvidia-container-runtime is directly compatible with the --gpus flag implemented by the docker CLI.
it has not implemented in tpuproxy, then tpu devices are not accessible in your docker container.
@manninglucas sure thing here it is:
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ls -l /sys/bus/pci/devices
total 0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:01.3 -> ../../../devices/pci0000:00/0000:00:01.3
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:03.0 -> ../../../devices/pci0000:00/0000:00:03.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:04.0 -> ../../../devices/pci0000:00/0000:00:04.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:05.0 -> ../../../devices/pci0000:00/0000:00:05.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:06.0 -> ../../../devices/pci0000:00/0000:00:06.0
lrwxrwxrwx 1 root root 0 Aug 19 16:27 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0
@milantracy Does this mean we can't use --tpuproxy at all with the Docker shim, or is there just some other way I need to invoke it? And if it's not possible, I assume it should work OK if I invoke runsc raw?
afaik, --tpuproxy doesn't work with the docker shim. cc: @manninglucas
I tried the raw runsc in the TPU v5e vm, which worked fine for me, let me know how it goes for you.
@milantracy would you mind sharing the command you're using to start runsc? I'm still having no luck using runsc do
#/bin/bash
sudo runsc --tpuproxy --root=/home/peyton/tputesting/runroot do --root=/home/peyton/tputesting/jax-rootfs -- env -u LD_PRELOAD /bin/bash
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh
starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{false false false false} false {true 0xc000620990} false}
EDIT: I'm also getting the same behavior with runsc run:
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ cat start.sh
#/bin/bash
sudo runsc --root=/home/peyton/tputesting/runroot --tpuproxy run --bundle=/home/peyton/tputesting my-jax-container
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh
running container: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true 0xc0005a2420} false}
And my config.json:
{
"ociVersion": "1.0.2",
"process": {
"terminal": true,
"user": {
"uid": 0,
"gid": 0
},
"args": [
"/bin/sh"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"LANG=C.UTF-8",
"PYTHONUNBUFFERED=1"
],
"cwd": "/"
},
"root": {
"path": "jax-rootfs",
"readonly": false
},
"hostname": "jax-container",
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
}
],
"linux": {
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
}
]
}
}
I've managed to track down the error to this function: https://github.com/google/gvisor/blob/e0643b8ed582cc549272e7788860a5dd4636c06d/pkg/sentry/fsimpl/sys/pci.go#L223
Specifically, this function fails when passed /sys/devices. It returns ENOENT. This is despite the file definitely existing:
peyton@t1v-n-901fc2b8-w-0:~/gvisor$ ls -alh /sys
total 4.0K
dr-xr-xr-x 13 root root 0 Aug 20 16:23 .
drwxr-xr-x 19 root root 4.0K Aug 20 16:24 ..
drwxr-xr-x 2 root root 0 Aug 20 16:23 block
drwxr-xr-x 40 root root 0 Aug 20 16:23 bus
drwxr-xr-x 68 root root 0 Aug 20 16:23 class
drwxr-xr-x 4 root root 0 Aug 20 16:23 dev
drwxr-xr-x 15 root root 0 Aug 20 16:23 devices
drwxr-xr-x 6 root root 0 Aug 20 16:23 firmware
drwxr-xr-x 9 root root 0 Aug 20 16:23 fs
drwxr-xr-x 2 root root 0 Aug 20 19:22 hypervisor
drwxr-xr-x 16 root root 0 Aug 20 16:23 kernel
drwxr-xr-x 152 root root 0 Aug 20 16:23 module
drwxr-xr-x 3 root root 0 Aug 20 19:22 power
I'll continue to investigate why this directory can't be opened.
It seems to me that there's something wrong with the mount. When I look inside the sandbox's namespace, /sys does not exist, but I expect it to:
peyton@t1v-n-901fc2b8-w-0:~/gvisor$ sudo ls /proc/75785/root/
etc proc
I'm not sure where to go from here - any pointers on what this should look like would be appreciated.
it has been a while since I did it last time, I will share the runsc command with you later.
@milantracy would you mind sharing the command you're using to start
runsc? I'm still having no luck usingrunsc do#/bin/bash sudo runsc --tpuproxy --root=/home/peyton/tputesting/runroot do --root=/home/peyton/tputesting/jax-rootfs -- env -u LD_PRELOAD /bin/bashpeyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{false false false false} false {true 0xc000620990} false}EDIT: I'm also getting the same behavior with
runsc run:peyton@t1v-n-901fc2b8-w-0:~/tputesting$ cat start.sh #/bin/bash sudo runsc --root=/home/peyton/tputesting/runroot --tpuproxy run --bundle=/home/peyton/tputesting my-jax-container peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh running container: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true 0xc0005a2420} false}And my
config.json:{ "ociVersion": "1.0.2", "process": { "terminal": true, "user": { "uid": 0, "gid": 0 }, "args": [ "/bin/sh" ], "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "LANG=C.UTF-8", "PYTHONUNBUFFERED=1" ], "cwd": "/" }, "root": { "path": "jax-rootfs", "readonly": false }, "hostname": "jax-container", "mounts": [ { "destination": "/proc", "type": "proc", "source": "proc" }, { "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] } ], "linux": { "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" } ] } }
also, can you share with me what the sys directory looks like in the container
@milantracy When I don't pass --tpuproxy, this is what it looks like:
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh
Child PID: 82314
Press Enter to continue...
# ls -alh /sys
total 0
drwxr-xr-x 12 root root 0 Aug 20 21:47 .
drwxrwxr-x 2 2004 2004 60 Aug 20 21:47 ..
drwxr-xr-x 2 root root 0 Aug 20 21:47 block
drwxr-xr-x 2 root root 0 Aug 20 21:47 bus
drwxr-xr-x 4 root root 0 Aug 20 21:47 class
drwxr-xr-x 2 root root 0 Aug 20 21:47 dev
drwxr-xr-x 4 root root 0 Aug 20 21:47 devices
drwxr-xr-x 2 root root 0 Aug 20 21:47 firmware
drwxr-xr-x 3 root root 0 Aug 20 21:47 fs
drwxr-xr-x 2 root root 0 Aug 20 21:47 kernel
drwxr-xr-x 2 root root 0 Aug 20 21:47 module
drwxr-xr-x 2 root root 0 Aug 20 21:47 power
When I do pass --tpuproxy, then /sys never gets mounted, so it doesn't exist.
When I spin up a cluster in GKE and run with tpuproxy, this is the sandbox spec that gets used. I would try to copy this spec wrt the mounts and devices sections specifically and see how that works. I see you don't have a /sys mount in your config.json. You may need to add a /sys mount specifically in the spec to get it working properly.
{
"ociVersion": "1.1.0",
"process": {
"user": {
"uid": 0,
"gid": 0,
"additionalGids": [
0
]
},
"args": [
"bash",
"-c",
"python -c 'import jax; print(\"TPU cores:\", jax.device_count())'"
],
"env": [
"PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HOSTNAME=tpu-gvisor",
"LANG=C.UTF-8",
"GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D",
"PYTHON_VERSION=3.10.14",
"PYTHON_PIP_VERSION=23.0.1",
"PYTHON_SETUPTOOLS_VERSION=65.5.1",
"PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/e03e1607ad60522cf34a92e834138eb89f57667c/public/get-pip.py",
"PYTHON_GET_PIP_SHA256=ee09098395e42eb1f82ef4acb231a767a6ae85504a9cf9983223df0a7cbd35d7",
"TPU_SKIP_MDS_QUERY=true",
"TPU_TOPOLOGY=2x2x1",
"ALT=false",
"TPU_HOST_BOUNDS=1,1,1",
"HOST_BOUNDS=1,1,1",
"TPU_RUNTIME_METRICS_PORTS=8431,8432,8433,8434",
"CHIPS_PER_HOST_BOUNDS=2,2,1",
"TPU_CHIPS_PER_HOST_BOUNDS=2,2,1",
"TPU_WORKER_ID=0",
"TPU_WORKER_HOSTNAMES=localhost",
"TPU_ACCELERATOR_TYPE=v5p-8",
"WRAP=false,false,false",
"TPU_TOPOLOGY_WRAP=false,false,false",
"TPU_TOPOLOGY_ALT=false",
"KUBERNETES_PORT_443_TCP_ADDR=34.118.224.1",
"KUBERNETES_SERVICE_HOST=34.118.224.1",
"KUBERNETES_SERVICE_PORT=443",
"KUBERNETES_SERVICE_PORT_HTTPS=443",
"KUBERNETES_PORT=tcp://34.118.224.1:443",
"KUBERNETES_PORT_443_TCP=tcp://34.118.224.1:443",
"KUBERNETES_PORT_443_TCP_PROTO=tcp",
"KUBERNETES_PORT_443_TCP_PORT=443"
],
"cwd": "/",
"apparmorProfile": "cri-containerd.apparmor.d",
"oomScoreAdj": 1000
},
"root": {
"path": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/rootfs"
},
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/proc",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/devpts",
"options": [
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620",
"gid=5"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/mqueue",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/sysfs",
"options": [
"nosuid",
"noexec",
"nodev",
"ro"
]
},
{
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "/run/containerd/io.containerd.runtime.v2.task/k8s.io/27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00/cgroup",
"options": [
"nosuid",
"noexec",
"nodev",
"relatime",
"ro"
]
},
{
"destination": "/etc/hosts",
"type": "bind",
"source": "/var/lib/kubelet/pods/0c23b742-a930-45e1-80d3-2b358141671e/etc-hosts",
"options": [
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/etc/hostname",
"type": "bind",
"source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/d711f9789021bb54a955a5c4155b796a79f508db3d9376fafc814ee91a0ce560/hostname",
"options": [
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/etc/resolv.conf",
"type": "bind",
"source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/d711f9789021bb54a955a5c4155b796a79f508db3d9376fafc814ee91a0ce560/resolv.conf",
"options": [
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d711f9789021bb54a955a5c4155b796a79f508db3d9376fafc814ee91a0ce560/shm",
"options": [
"rprivate",
"rw"
]
},
{
"destination": "/run/secrets/kubernetes.io/serviceaccount",
"type": "bind",
"source": "/var/lib/kubelet/pods/0c23b742-a930-45e1-80d3-2b358141671e/volumes/kubernetes.io~projected/kube-api-access-9fzvk",
"options": [
"rbind",
"rprivate",
"ro"
]
}
],
"annotations": {
"dev.gvisor.flag.debug": "true",
"dev.gvisor.flag.debug-log": "/tmp/runsc/",
"dev.gvisor.flag.panic-log": "/tmp/runsc/panic.log",
"dev.gvisor.flag.strace": "true",
"dev.gvisor.internal.tpuproxy": "true",
"io.kubernetes.cri.container-name": "tpu-gvisor",
"io.kubernetes.cri.container-type": "container",
"io.kubernetes.cri.image-name": "gcr.io/gvisor-presubmit/tpu/jax_x86_64:latest",
"io.kubernetes.cri.sandbox-id": "d711f9789021bb54a955a5c4155b796a79f508db3d9376fafc814ee91a0ce560",
"io.kubernetes.cri.sandbox-name": "tpu-gvisor",
"io.kubernetes.cri.sandbox-namespace": "default",
"io.kubernetes.cri.sandbox-uid": "0c23b742-a930-45e1-80d3-2b358141671e"
},
"linux": {
"uidMappings": [
{
"containerID": 0,
"hostID": 0,
"size": 4294967295
}
],
"gidMappings": [
{
"containerID": 0,
"hostID": 0,
"size": 4294967295
}
],
"resources": {
"memory": {},
"cpu": {
"shares": 2,
"period": 100000
},
"unified": {
"memory.oom.group": "1",
"memory.swap.max": "0"
}
},
"cgroupsPath": "kubepods-besteffort-pod0c23b742_a930_45e1_80d3_2b358141671e.slice:cri-containerd:27ee155751166dcc9569871355a0f90babbd37f94b11a8879f83757597e7da00",
"namespaces": [
{
"type": "pid"
},
{
"type": "ipc",
"path": "/proc/8016/ns/ipc"
},
{
"type": "uts",
"path": "/proc/8016/ns/uts"
},
{
"type": "mount"
},
{
"type": "network",
"path": "/proc/8016/ns/net"
},
{
"type": "cgroup"
},
{
"type": "user"
}
],
"devices": [
{
"path": "/dev/vfio/2",
"type": "c",
"major": 245,
"minor": 1,
"fileMode": 438,
"uid": 0,
"gid": 0
},
{
"path": "/dev/vfio/3",
"type": "c",
"major": 245,
"minor": 0,
"fileMode": 438,
"uid": 0,
"gid": 0
},
{
"path": "/dev/vfio/0",
"type": "c",
"major": 245,
"minor": 3,
"fileMode": 438,
"uid": 0,
"gid": 0
},
{
"path": "/dev/vfio/1",
"type": "c",
"major": 245,
"minor": 2,
"fileMode": 438,
"uid": 0,
"gid": 0
},
{
"path": "/dev/vfio/vfio",
"type": "c",
"major": 10,
"minor": 196,
"fileMode": 438,
"uid": 0,
"gid": 0
}
]
}
}
@manninglucas thanks for the config! I've tried this with a /sys mount, and I'm still getting the same error:
{
"ociVersion": "1.0.2",
"process": {
"terminal": true,
"user": {
"uid": 0,
"gid": 0
},
"args": [
"/bin/sh"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"LANG=C.UTF-8",
"PYTHONUNBUFFERED=1"
],
"cwd": "/"
},
"root": {
"path": "jax-rootfs",
"readonly": false
},
"hostname": "jax-container",
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "/sys",
"options": [
"nosuid",
"noexec",
"nodev",
"ro"
]
}
],
"linux": {
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
}
]
}
}
and the run:
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ ./start.sh
Child PID: 235453
Press Enter to continue...
running container: starting container: starting root container: starting sandbox: failed to setupFS: mounting submounts: mount submount "/sys": failed to mount "/sys" (type: sysfs): no such file or directory, opts: &{{true false false false} true {true 0xc00059e8a0} false}
For what it's worth, this error message is not coming from the inner container - it's coming from the runsc-sandbox process which is in its own mount namespace that appears to only have /etc/ and /proc. This leads me to believe that the sandbox process needs to have /sys mirrored in, but I'm not sure how to do that.
From my testing, I'm actually not sure how the GKE sandbox example works. It looks to me like /sys isn't mounted in the sandbox's namespace, so I'm surprised that the hostDirEntries call doesn't fail. Are you avoiding putting the sandbox process in its own namespace or something?
I see the issue - the code is looking for TPU devices in specific paths to see if it should bind down to the container. The issues is that in GCE VMs, those paths don't exist! I'm not sure how they work in the first place then, though :)
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ sudo find /dev | grep vfio
/dev/vfio
/dev/vfio/vfio
peyton@t1v-n-901fc2b8-w-0:~/tputesting$ sudo find /dev | grep accel
That behavior is very strange to me. FWIW here's what I see in my VM when I find /dev/ | grep vfio.
/dev/vfio
/dev/vfio/0
/dev/vfio/1
/dev/vfio/2
/dev/vfio/3
/dev/vfio/vfio
What do you get when you run an unsandboxed TPU workload? How did you create your TPU VM?
@manninglucas It turns out that at least part of this issue was the TPU VM image I was using. I was using tpu-vm-base, which apparently is quite out of date:
https://github.com/google/jax/issues/13260
I've now switched to tpu-ubuntu2204-base:
➜ ~ gcloud compute tpus tpu-vm create another-peyton-tpu \
--zone=us-central1-a \
--accelerator-type=v5litepod-1 \
--version=tpu-ubuntu2204-base \
--project=<redacted>
While this is getting further, it's now failing at a later step in the parsing:
W0822 17:59:26.377299 62469 util.go:64] FATAL ERROR: error setting up chroot: error configuring chroot for TPU devices: extracting TPU device minor: open /sys/class/vfio-dev/vfio0/device/vendor: no such file or directory
error setting up chroot: error configuring chroot for TPU devices: extracting TPU device minor: open /sys/class/vfio-dev/vfio0/device/vendor: no such file or directory
Do you know what image the GKE VMs are using?
I may need to use v2-alpha-tpuv5-lite. I'll try that and get back to you. The fact that device mounting is different depending on the image used is really surprising to me. And it's even more surprising that you're allowed to mount incompatible images.
https://cloud.google.com/tpu/docs/runtimes#pytorch_and_jax
@manninglucas I've tried with the new image with no luck. It looks like the device layout still does not match what GVisor expects. If you know what VM image GKE uses that would be helpful. Here is some output:
peyton@t1v-n-9becfdd7-w-0:~/tputesting$ python3 -c "import jax; print(jax.device_count()); print(repr(jax.numpy.add(1, 1)))"
1
Array(2, dtype=int32, weak_type=True)
peyton@t1v-n-9becfdd7-w-0:~/tputesting$ sudo find /sys/class/vfio/
/sys/class/vfio/
/sys/class/vfio/0
peyton@t1v-n-9becfdd7-w-0:~/tputesting$ sudo find /dev/vfio/
/dev/vfio/
/dev/vfio/0
/dev/vfio/vfio
peyton@t1v-n-9becfdd7-w-0:~/tputesting$ ./start.sh
running container: creating container: cannot create sandbox: cannot read client sync file: waiting for sandbox to start: EOF
And here are the relevant logs again:
I0822 20:44:05.045982 81014 main.go:201] **************** gVisor ****************
I0822 20:44:05.046877 81014 boot.go:264] Setting product_name: "Google Compute Engine"
I0822 20:44:05.046939 81014 boot.go:274] Setting host-shmem-huge: "never"
W0822 20:44:05.047571 81014 specutils.go:129] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
I0822 20:44:05.047595 81014 chroot.go:91] Setting up sandbox chroot in "/tmp"
I0822 20:44:05.047707 81014 chroot.go:36] Mounting "/proc" at "/tmp/proc"
W0822 20:44:05.047808 81014 util.go:64] FATAL ERROR: error setting up chroot: error configuring chroot for TPU devices: extracting TPU device minor: open /sys/class/vfio-dev/vfio0/device/device: no such file or directory
error setting up chroot: error configuring chroot for TPU devices: extracting TPU device minor: open /sys/class/vfio-dev/vfio0/device/device: no such file or directory
I believe the image based on COS, should be something like "tpu-vm-cos-109"
@manninglucas Nice those paths exist on that image:
peyton@t1v-n-00ca9571-w-0 ~ $ sudo find /dev/vfio/
/dev/vfio/
/dev/vfio/0
/dev/vfio/vfio
peyton@t1v-n-00ca9571-w-0 ~ $ sudo find /sys/class/vfio-dev/
/sys/class/vfio-dev/
/sys/class/vfio-dev/vfio0
This image is painful to work with because of the read-only filesystem, though. I may have to bite the bullet and figure out how to do the device mapping on v2-alpha-tpuv5-lite.
I will have a patch up soon that will hopefully fix the issue for the ubuntu image you're using. Seems like /sys/class/vfio-dev/vfio0 just corresponds to /sys/class/vfio/0.
This image is painful to work with because of the read-only filesystem, though. I may have to bite the bullet and figure out how to do the device mapping on
v2-alpha-tpuv5-lite.
You can remount the filesystem with mount -o remount,rw as root.
btw, COS has a tool called cos-toolbox which works around this issue and makes it easier to work with in general. It should be available by default.
Hey @pawalt were you able to get this working for your needs?
hey @manninglucas we deprioritized getting this working. I think Peyton was maybe going to restart the effort when the patch landed: https://github.com/google/gvisor/issues/10795#issuecomment-2305739322.
Gotcha. The patch has finally landed (290789b), let me know when you're able to test this out again!
@manninglucas thanks! The container is now starting up. I'm seeing a different issue when trying to use jax in python. Not sure if you want to make that part of this issue or another one:
peyton@t1v-n-1f714773-w-0:~/tputesting$ ./start.sh
# python3
Python 3.11.9 (main, Aug 13 2024, 02:18:20) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> jax.device_count()
Failed to get TPU metadata (tpu-env) from instance metadata for variable CHIPS_PER_HOST_BOUNDS: INTERNAL: Couldn't connect to server
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/gcp_metadata_utils.cc:99
learning/45eac/tfrc/runtime/env_var_utils.cc:50
Failed to get TPU metadata (tpu-env) from instance metadata for variable HOST_BOUNDS: INTERNAL: Couldn't connect to server
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/gcp_metadata_utils.cc:99
learning/45eac/tfrc/runtime/env_var_utils.cc:50
^C^C^C^CFailed to get TPU metadata (tpu-env) from instance metadata for variable ALT: INTERNAL: Couldn't connect to server
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/gcp_metadata_utils.cc:99
learning/45eac/tfrc/runtime/env_var_utils.cc:50
The command hangs on the jax.device_count() call, and it loops, spitting out a lot of these logs:
I0917 13:42:09.296208 1 strace.go:576] [ 2: 26] python3 E futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=182 nsec=996258753}, 0x0, 0xffffffff)
I0917 13:42:09.301492 1 strace.go:614] [ 2: 26] python3 X futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=182 nsec=996258753}, 0x0, 0xffffffff) = 0 (0x0) errno=110 (connection timed out) (5.274369ms)
I0917 13:42:09.301517 1 strace.go:576] [ 2: 26] python3 E futex(0x7ef41c4bce58, FUTEX_WAKE|FUTEX_PRIVATE_FLAG, 0x1, null, 0x3b66c325, 0x16e)
I0917 13:42:09.301526 1 strace.go:614] [ 2: 26] python3 X futex(0x7ef41c4bce58, FUTEX_WAKE|FUTEX_PRIVATE_FLAG, 0x1, null, 0x3b66c325, 0x16e) = 0 (0x0) (840ns)
I0917 13:42:09.301539 1 strace.go:576] [ 2: 26] python3 E futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=183 nsec=1590373}, 0x0, 0xffffffff)
I0917 13:42:09.306891 1 strace.go:614] [ 2: 26] python3 X futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=183 nsec=1590373}, 0x0, 0xffffffff) = 0 (0x0) errno=110 (connection timed out) (5.341239ms)
I0917 13:42:09.306924 1 strace.go:576] [ 2: 26] python3 E futex(0x7ef41c4bce58, FUTEX_WAKE|FUTEX_PRIVATE_FLAG, 0x1, null, 0x1e73de, 0x170)
I0917 13:42:09.306938 1 strace.go:614] [ 2: 26] python3 X futex(0x7ef41c4bce58, FUTEX_WAKE|FUTEX_PRIVATE_FLAG, 0x1, null, 0x1e73de, 0x170) = 0 (0x0) (1.32µs)
I0917 13:42:09.306953 1 strace.go:576] [ 2: 26] python3 E futex(0x7ef41c4bce54, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0x0, 0x7ef3cf9feb60 {sec=183 nsec=6995742}, 0x0, 0xffffffff)
My startup script:
sudo runsc --debug \
--debug-log=/home/peyton/tputesting/logs/ \
--strace \
--root=/home/peyton/tputesting/runroot \
--tpuproxy \
run \
--bundle=/home/peyton/tputesting \
my-jax-container
I'm using a jax image exported from the build below:
FROM python:3.11
RUN pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
@pawalt let's follow up with a new issue. Looks like libtpu is looking for some metadata that might be stored in an environment variable. Can you run env on the host?
@manninglucas I've opened #10923