UniAD icon indicating copy to clipboard operation
UniAD copied to clipboard

GPU memory is not released after training process is stopped

Open thunguyenth opened this issue 10 months ago • 1 comments

Hi, thank you for sharing your work ^^ I have a problem that:

The GPU memory is not released if I forcibly stop the training process (by using Ctrl-C in the terminal)

Config: - Ubuntu: 20.04 LTS - GPU: Geforce RTX 3090 (24GB) - CUDA: 11.1 - Nvidia driver: 470.239.06

Actions:

Step 1. Training the stage 2 on the nuScene dataset v1.0-mini version

./tools/uniad_dist_train.sh ./projects/configs/stage2_e2e/base_e2e.py 1

=> The training process is working normally

Step 2. Stop the training after a few iterations of the 1st epoch by using Ctrl-C in the terminal

Step 3. Re-run the training in step 1 => Out of memory ERROR!!!

I checked the GPU state by nvidia-smi as shown in the below screenshot (when the training was already stopped a few minutes ago), the GPU memory used in the training process at Step 1 was not released (17323MiB / 24259MiB).

This issue can be easily solved by releasing the GPU memory manually, but I wonder if this issue happens to everyone or if it just happens in my case? (since I couldn't find a similar issue reported in this repo) and Why is the GPU memory not released even though the training has stopped? I would appreciate it if you could help me clarify this.

Best regards.

image

thunguyenth avatar Apr 10 '24 07:04 thunguyenth

Hi @thunguyenth. To terminate the process in Linux, you can use the command pkill -9 python.

YTEP-ZHI avatar Apr 28 '24 08:04 YTEP-ZHI

Thank you, @YTEP-ZHI, for your reply ^^

I would appreciate it if someone could explain why the GPU memory is not released after the training is stopped forcibly.

thunguyenth avatar May 13 '24 11:05 thunguyenth