UniAD
UniAD copied to clipboard
GPU memory is not released after training process is stopped
Hi, thank you for sharing your work ^^ I have a problem that:
The GPU memory is not released if I forcibly stop the training process (by using Ctrl-C in the terminal)
Config: - Ubuntu: 20.04 LTS - GPU: Geforce RTX 3090 (24GB) - CUDA: 11.1 - Nvidia driver: 470.239.06
Actions:
Step 1. Training the stage 2 on the nuScene dataset v1.0-mini version
./tools/uniad_dist_train.sh ./projects/configs/stage2_e2e/base_e2e.py 1
=> The training process is working normally
Step 2. Stop the training after a few iterations of the 1st epoch by using Ctrl-C in the terminal
Step 3. Re-run the training in step 1 => Out of memory ERROR!!!
I checked the GPU state by nvidia-smi as shown in the below screenshot (when the training was already stopped a few minutes ago), the GPU memory used in the training process at Step 1 was not released (17323MiB / 24259MiB).
This issue can be easily solved by releasing the GPU memory manually, but I wonder if this issue happens to everyone or if it just happens in my case? (since I couldn't find a similar issue reported in this repo) and Why is the GPU memory not released even though the training has stopped? I would appreciate it if you could help me clarify this.
Best regards.
Hi @thunguyenth. To terminate the process in Linux, you can use the command pkill -9 python
.
Thank you, @YTEP-ZHI, for your reply ^^
I would appreciate it if someone could explain why the GPU memory is not released after the training is stopped forcibly.