aws-virtual-gpu-device-plugin
aws-virtual-gpu-device-plugin copied to clipboard
GPU Memory errors leads to hanging GPU
Hello, when using this plugin, I was able to run pytorch
models on a shared GPU and everything works smoothly but in some cases, when one pod starts using a lot of memory, instead of classic CUDA Memory Error
the gpu start spitting:
CUDA error: an illegal memory access was encountered
.
And once it hits that point it's impossible to get access to the GPU through pytorch
anymore.
The problem could not be reproduced on standard nvidia-plugin
hence this issue.
- Using T4 on
g4dn
instances. -
nvidia-smi
outputs runs correctly (shows GPU with near 0 memory usage 0% util andE. Process
) -
python -c "import torch; torch.cuda.is_available()"
hangs instead, and it's impossible to send data to the GPU either (torch.zeros((2, 2)).cuda()
) hangs too. Even Ctrl+c cannot cut the process at that point. - the plugin daemon set logs:
[2021-01-22 16:48:44.478 Other 70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.478 Other 70] Volta MPS: Client disconnected
[2021-01-22 16:48:44.493 Other 70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.493 Other 70] Volta MPS: Client disconnected
[2021-01-22 16:48:47.756 Control 1] Accepting connection...
[2021-01-22 16:48:47.756 Control 1] User did not send valid credentials
[2021-01-22 16:48:47.756 Control 1] Accepting connection...
[2021-01-22 16:48:47.756 Control 1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:48:47.756 Other 70] Volta MPS Server: Received new client request
[2021-01-22 16:48:47.757 Other 70] MPS Server: worker created
[2021-01-22 16:48:47.757 Other 70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Control 1] Accepting connection...
[2021-01-22 16:49:00.586 Control 1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:00.586 Other 70] Volta MPS Server: Received new client request
[2021-01-22 16:49:00.586 Other 70] MPS Server: worker created
[2021-01-22 16:49:00.586 Other 70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Other 70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated
[2021-01-22 16:49:07.339 Control 1] Accepting connection...
[2021-01-22 16:49:07.339 Control 1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:07.339 Other 70] Volta MPS Server: Received new client request
[2021-01-22 16:49:07.339 Other 70] MPS Server: worker created
[2021-01-22 16:49:07.339 Other 70] Volta MPS: Creating worker thread
[2021-01-22 16:49:07.339 Other 70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated
Pytorch does not support yet memory limitation (https://github.com/pytorch/pytorch/blob/5f07b53ec2074fc9fd6b4fe72d6cee4d484b917a/torch/cuda/memory.py#L75)
Is there any way in the meantime, to understand/debug the problem and to clear the GPU/the pod/the daemonset so that they start working again ? (And what are those illegal accesses ?) Do you need any more info ?
Nuking both daemonset and guilty pod seems to work but a bit heavy handed atm.
@Narsil @marbo001 I'm seeing the same issue on some nodes in my cluster. There is no pattern to it. It happens occasionally and randomly. Killing both the aws virtual gpu pod and the app pod consistently fixes the issue but I'm looking for a more permanent solution that also doesn't require manual resolution. Have you found a solution for this?
@amybachir Enforce Gpu RAM limitations on your various pods, this is the only way we were able to workaround the issue. You can still get random errors still.
pytorch
uses a Ring Arena for it's allocations, so be sure to empty the cache regulary (torch.cuda.empty_cache iirc)when using multiple pods, otherwise the ring will almost always outgrow what you expect it should use.