gvisor
gvisor copied to clipboard
nvproxy: unknown control command 0x3d05
Description
Doing multi-GPU training on A100s and seeing that on gVisor it gets stuck. Tried the below program on the following GPUs within Modal:
- A100 40 GiB (Oracle Cloud) ❌
- H100 (a3-highgpu-8g) ❌
- A10G ✔️
- T4 ✔️
Both the H100 and A100 run into these unknown control commands:
W0509 01:16:28.218428 1772489 frontend.go:521] [ 6: 20] nvproxy: unknown control command 0x3d05 (paramsSize=24)
W0509 01:16:28.218780 1772489 frontend.go:521] [ 5: 22] nvproxy: unknown control command 0x3d05 (paramsSize=24)
Which is NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD -> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/083cd9cf17ab95cd6f9fb50a5349c21eaa2f7d4b/src/common/sdk/nvidia/inc/ctrl/ctrl0000/ctrl0000unix.h#L146-L147
Steps to reproduce
FROM nvidia/cuda:12.2.0-devel-ubuntu20.04
RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler
COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
from memory_profiler import profile
from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader
class MagixNet(L.LightningModule):
def __init__(self, nbr_cat):
super().__init__()
module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
module.fc = nn.Linear(2048, nbr_cat)
self.module = module
def forward(self, x):
return self.module(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
def prepare_data():
pipeline = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
])
train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)
val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)
return train_dl, val_dl
torch.set_float32_matmul_precision('medium')
train_dl, val_dl = prepare_data()
model = MagixNet(100)
trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")
start = time.time()
trainer.fit(model, train_dl, val_dl)
print(f"Training duration (seconds): {time.time() - start:.2f}")
EOF
ENTRYPOINT ["python3", "repro.py"]
runsc version
`runsc version 6e61813c1b37
spec: 1.1.0-rc.1`
docker version (if using docker)
N/A
uname
No response
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
- https://modal-public-assets.s3.amazonaws.com/runsc.log.20240509-011604.390806.boot.txt.zip
- https://modal-public-assets.s3.amazonaws.com/runsc.log.20240509-005323.597890.boot.txt.zip
The reproduction program is almost identical to the one in https://github.com/google/gvisor/issues/9827, which is why I revisited that issue's test.
This seems to be running fine for me on an A100-40GB machine in GCE on driver version 535.104.05:
(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='15:24:33') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9193099.59it/s]
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 156MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M Trainable params
0 Non-trainable params
23.7 M Total params
94.852 Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00, 5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00, 5.62it/s, v_num=0]
-------------------------------------------------------------------------------
repro.py 63 <module>
print(f"Training duration (seconds): {time.time() - start:2.f}")
ValueError:
Format specifier missing precision
(base) ayushranjan_google_com@a100:~/issue10413$ nvidia-smi
Thu May 9 15:27:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 35C P0 49W / 400W | 4MiB / 40960MiB | 27% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Please note:
- There seems to be an issue with the last print statement in repro.py. Other than that, the application seems to work fine.
- I am using
--shm-size=128gas per https://github.com/google/gvisor/issues/9827#issuecomment-1877649009. - The debug logs don't have any
nvproxy: unknownlines.
So maybe you are using a different driver version? Or maybe something to do with the Oracle Cloud environment?
- Oh yep, fixed that in the original description.
- Our
--shm-sizeis also set very large. On Oracle workers it's around 1657GB.
We have Driver Version: 535.129.03 CUDA Version: 12.2. Sorry should have included that in the issue originally!
On H100 worker:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:04:00.0 Off | 0 |
| N/A 36C P0 113W / 700W | 72459MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:05:00.0 Off | 0 |
| N/A 34C P0 117W / 700W | 72507MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:0A:00.0 Off | 0 |
| N/A 35C P0 114W / 700W | 72507MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:0B:00.0 Off | 0 |
| N/A 33C P0 111W / 700W | 72587MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:84:00.0 Off | 0 |
| N/A 60C P0 578W / 700W | 71533MiB / 81559MiB | 95% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:85:00.0 Off | 0 |
| N/A 34C P0 112W / 700W | 841MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:8A:00.0 Off | 0 |
| N/A 34C P0 114W / 700W | 16463MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:8B:00.0 Off | 0 |
| N/A 34C P0 111W / 700W | 2405MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 759790 C /opt/conda/bin/python3.10 72446MiB |
We use the same driver version across all GPU workers.
Updated driver version and still can not repro the failure on my GCE VM:
(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='16:01:41') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9140159.09it/s]
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 74.1MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M Trainable params
0 Non-trainable params
23.7 M Total params
94.852 Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00, 5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00, 5.62it/s, v_num=0]
Training duration (seconds): 72.35
Surprisingly, this workload gets stuck without gVisor. I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.
Surprisingly, this workload gets stuck without gVisor.
Interesting. This may be the same problem as in https://github.com/google/gvisor/issues/9827 where the test got stuck on runc.
The program doesn't get stuck on runc in Modal. It completes in around 60s. A 72.35 second completion for gVisor lines up with that.
I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.
🙏
@thundergolfer Let me know if https://github.com/google/gvisor/commit/e9b3218681cdfac0989e95b27642e4aec67d0ea6 fixes the issue. If so, please close this.
Are you still hitting this issue?
No we're not, happy to have it closed 👍