aws-virtual-gpu-device-plugin icon indicating copy to clipboard operation
aws-virtual-gpu-device-plugin copied to clipboard

GPU Memory errors leads to hanging GPU

Open Narsil opened this issue 4 years ago • 2 comments

Hello, when using this plugin, I was able to run pytorch models on a shared GPU and everything works smoothly but in some cases, when one pod starts using a lot of memory, instead of classic CUDA Memory Error the gpu start spitting: CUDA error: an illegal memory access was encountered.

And once it hits that point it's impossible to get access to the GPU through pytorch anymore. The problem could not be reproduced on standard nvidia-plugin hence this issue.

  • Using T4 on g4dn instances.
  • nvidia-smi outputs runs correctly (shows GPU with near 0 memory usage 0% util and E. Process)
  • python -c "import torch; torch.cuda.is_available()" hangs instead, and it's impossible to send data to the GPU either (torch.zeros((2, 2)).cuda()) hangs too. Even Ctrl+c cannot cut the process at that point.
  • the plugin daemon set logs:
[2021-01-22 16:48:44.478 Other    70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.478 Other    70] Volta MPS: Client disconnected
[2021-01-22 16:48:44.493 Other    70] Receive command failed, assuming client exit
[2021-01-22 16:48:44.493 Other    70] Volta MPS: Client disconnected
[2021-01-22 16:48:47.756 Control     1] Accepting connection...
[2021-01-22 16:48:47.756 Control     1] User did not send valid credentials
[2021-01-22 16:48:47.756 Control     1] Accepting connection...
[2021-01-22 16:48:47.756 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:48:47.756 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:48:47.757 Other    70] MPS Server: worker created
[2021-01-22 16:48:47.757 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Control     1] Accepting connection...
[2021-01-22 16:49:00.586 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:00.586 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:49:00.586 Other    70] MPS Server: worker created
[2021-01-22 16:49:00.586 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:00.586 Other    70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated
[2021-01-22 16:49:07.339 Control     1] Accepting connection...
[2021-01-22 16:49:07.339 Control     1] NEW CLIENT 0 from user 0: Server already exists
[2021-01-22 16:49:07.339 Other    70] Volta MPS Server: Received new client request
[2021-01-22 16:49:07.339 Other    70] MPS Server: worker created
[2021-01-22 16:49:07.339 Other    70] Volta MPS: Creating worker thread
[2021-01-22 16:49:07.339 Other    70] Volta MPS: Device Tesla T4 (uuid 0x77b4f43a-0x2b8e3746-0x77cf831c-0x9b534399) is associated

Pytorch does not support yet memory limitation (https://github.com/pytorch/pytorch/blob/5f07b53ec2074fc9fd6b4fe72d6cee4d484b917a/torch/cuda/memory.py#L75)

Is there any way in the meantime, to understand/debug the problem and to clear the GPU/the pod/the daemonset so that they start working again ? (And what are those illegal accesses ?) Do you need any more info ?

Nuking both daemonset and guilty pod seems to work but a bit heavy handed atm.

Narsil avatar Jan 22 '21 16:01 Narsil

@Narsil @marbo001 I'm seeing the same issue on some nodes in my cluster. There is no pattern to it. It happens occasionally and randomly. Killing both the aws virtual gpu pod and the app pod consistently fixes the issue but I'm looking for a more permanent solution that also doesn't require manual resolution. Have you found a solution for this?

amybachir avatar Jan 24 '22 15:01 amybachir

@amybachir Enforce Gpu RAM limitations on your various pods, this is the only way we were able to workaround the issue. You can still get random errors still.

pytorch uses a Ring Arena for it's allocations, so be sure to empty the cache regulary (torch.cuda.empty_cache iirc)when using multiple pods, otherwise the ring will almost always outgrow what you expect it should use.

Narsil avatar Jan 24 '22 16:01 Narsil