UniAD icon indicating copy to clipboard operation
UniAD copied to clipboard

Experiencing issues while running training tasks on dual 4090 GPUs.

Open ktyang512 opened this issue 8 months ago • 11 comments

I have two RTX 4090-48GB GPUs. After setting queue_length=4 and following the instructions, all training tasks are forced onto the first GPU while the second GPU remains unused. This leads to training failures due to the memory allocation. How can I diagnose and resolve this issue?

ktyang512 avatar Apr 07 '25 12:04 ktyang512

First, Add --launcher pytorch in uniad_dist_train.sh. Then, change in self.class_names = self.class_range.keys() to self.class_names = list(self.class_range.keys()) in this file 'python3.8/site-packages/nuscenes/eval/detection/data_classes.py'.

XYunaaa avatar May 04 '25 06:05 XYunaaa

Are you able to run eval on dual GPU 4090 ? When I run eval by specifying dual GPU, it freezes after running detection task. If you do able to run through eval on dual GPU, can you please share your environment file ?

[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6020/6019, 3.1 task/s, elapsed: 1919s, ETA: 0s

[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808032 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1264395 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1264394) of binary: /home/jagdish/miniconda3/envs/uniad2.0/bin/python Traceback (most recent call last): File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in main() File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

jagdishbhanushali avatar May 07 '25 19:05 jagdishbhanushali

in train.py , add timeout for waiting more time in 'init_dist(args.launcher, timeout=timedelta(seconds=36000), **cfg.dist_params)' @jagdishbhanushali

Tonny24Wang avatar May 09 '25 03:05 Tonny24Wang

what about this (when running eval)? [>>>>>>>>>>>>>>>>>>>>>>>>>>] 6020/6019, 5.3 task/s, elapsed: 1136s, ETA: 0sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 94832 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 94831) of binary: /home/ubuntu/anaconda3/envs/uniad2.0/bin/python Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in main() File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/test.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-05-09_16:10:24 host : ubun rank : 0 (local_rank: 0) exitcode : -11 (pid: 94831) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 94831

SpeMercurial avatar May 09 '25 08:05 SpeMercurial

Are you able to run eval on dual GPU 4090 ? When I run eval by specifying dual GPU, it freezes after running detection task. It you do able to run through eval on dual GPU, can you please share your environment file ?

[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6020/6019, 3.1 task/s, elapsed: 1919s, ETA: 0s

[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808032 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1264395 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1264394) of binary: /home/jagdish/miniconda3/envs/uniad2.0/bin/python Traceback (most recent call last): File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in main() File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

have you solved your issue?

SpeMercurial avatar May 10 '25 10:05 SpeMercurial

in train.py , add timeout for waiting more time in 'init_dist(args.launcher, timeout=timedelta(seconds=36000), **cfg.dist_params)' @jagdishbhanushali

Thanks @Tonny24Wang , it worked. Hi @SpeMercurial , I replaced line 183 in test.py file with suggested block.

import datetime
init_dist(args.launcher,timeout=datetime.timedelta(seconds=36000), **cfg.dist_params)

jagdishbhanushali avatar May 10 '25 20:05 jagdishbhanushali

hi, I tried your solutions. My memory and swp memory have been fully occupied, but the training process has stalled. Do you have any solutions?

Image Image

deepmeng avatar Jul 21 '25 10:07 deepmeng

Hi @deepmeng , I don't see your GPU is getting used. May be your script is not using GPU or CUDA is not installed properly.

jagdishbhanushali avatar Jul 21 '25 17:07 jagdishbhanushali

Hi @deepmeng , I don't see your GPU is getting used. May be your script is not using GPU or CUDA is not installed properly.

In fact, the training got stuck and only started after a very long process. I suspect that after making the above two changes, the data set loading time became extremely long and the memory usage was very high. Do you have any solutions?

Additionally, if I were to directly evaluate the model performance as the source code from the repository, the two GPUs work well.

deepmeng avatar Jul 22 '25 03:07 deepmeng

I did not adopt the method of modifying "nuscenes/eval/detection/data_classes.py" because it would result in extremely high memory consumption and the dataset loading process would take an excessively long time. I referred to this method

Ultimately, I only made two modifications, and then I was able to simultaneously utilize two GPUs for training.

(1) Add " --launcher pytorch" in "uniad_dist_train.sh". (2) Add "torch.multiprocessing.set_start_method('fork')" in "train.py".

deepmeng avatar Jul 23 '25 14:07 deepmeng

@deepmeng How long does the full training (20 epochs - 6 + 14) take on two 48 GB GPUs?

ssuralcmu avatar Sep 02 '25 06:09 ssuralcmu

Hi @deepmeng, I think I experienced a quite similar problem. I trained the Stage 2 (e2e) model with 3 GPUs (4090), 64GB RAM, and 64GB swap. At the beginning, it took about 87 GB of (RAM and swap) and kept increasing gradually. After 1 epoch (nuScenes), the memory got full and this caused an error.

I found out that reducing the workers_per_gpu (How many subprocesses to use for data loading for each GPU) can help. I reduced it from 8 to 2, and surprisingly, the training speed remains the same, while the memory usage is significantly less than before. I think one of the reasons is that the GPU's working speed is much slower than the data loading process, so the loaded data waiting to be processed will be put in a queue, and this gradually eats up all the memory.

thunguyenth avatar Nov 18 '25 11:11 thunguyenth