Experiencing issues while running training tasks on dual 4090 GPUs.
I have two RTX 4090-48GB GPUs. After setting queue_length=4 and following the instructions, all training tasks are forced onto the first GPU while the second GPU remains unused. This leads to training failures due to the memory allocation. How can I diagnose and resolve this issue?
First, Add --launcher pytorch in uniad_dist_train.sh. Then, change in self.class_names = self.class_range.keys() to self.class_names = list(self.class_range.keys()) in this file 'python3.8/site-packages/nuscenes/eval/detection/data_classes.py'.
Are you able to run eval on dual GPU 4090 ? When I run eval by specifying dual GPU, it freezes after running detection task. If you do able to run through eval on dual GPU, can you please share your environment file ?
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6020/6019, 3.1 task/s, elapsed: 1919s, ETA: 0s
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808032 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1264395 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1264394) of binary: /home/jagdish/miniconda3/envs/uniad2.0/bin/python
Traceback (most recent call last):
File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in
in train.py , add timeout for waiting more time in 'init_dist(args.launcher, timeout=timedelta(seconds=36000), **cfg.dist_params)' @jagdishbhanushali
what about this (when running eval)?
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6020/6019, 5.3 task/s, elapsed: 1136s, ETA: 0sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 94832 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 94831) of binary: /home/ubuntu/anaconda3/envs/uniad2.0/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in
main()
File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./tools/test.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-05-09_16:10:24 host : ubun rank : 0 (local_rank: 0) exitcode : -11 (pid: 94831) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 94831
Are you able to run eval on dual GPU 4090 ? When I run eval by specifying dual GPU, it freezes after running detection task. It you do able to run through eval on dual GPU, can you please share your environment file ?
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6020/6019, 3.1 task/s, elapsed: 1919s, ETA: 0s
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808032 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1264395 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 1264394) of binary: /home/jagdish/miniconda3/envs/uniad2.0/bin/python Traceback (most recent call last): File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in main() File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jagdish/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
have you solved your issue?
in train.py , add timeout for waiting more time in 'init_dist(args.launcher, timeout=timedelta(seconds=36000), **cfg.dist_params)' @jagdishbhanushali
Thanks @Tonny24Wang , it worked. Hi @SpeMercurial , I replaced line 183 in test.py file with suggested block.
import datetime
init_dist(args.launcher,timeout=datetime.timedelta(seconds=36000), **cfg.dist_params)
hi, I tried your solutions. My memory and swp memory have been fully occupied, but the training process has stalled. Do you have any solutions?
Hi @deepmeng , I don't see your GPU is getting used. May be your script is not using GPU or CUDA is not installed properly.
Hi @deepmeng , I don't see your GPU is getting used. May be your script is not using GPU or CUDA is not installed properly.
In fact, the training got stuck and only started after a very long process. I suspect that after making the above two changes, the data set loading time became extremely long and the memory usage was very high. Do you have any solutions?
Additionally, if I were to directly evaluate the model performance as the source code from the repository, the two GPUs work well.
I did not adopt the method of modifying "nuscenes/eval/detection/data_classes.py" because it would result in extremely high memory consumption and the dataset loading process would take an excessively long time. I referred to this method
Ultimately, I only made two modifications, and then I was able to simultaneously utilize two GPUs for training.
(1) Add " --launcher pytorch" in "uniad_dist_train.sh". (2) Add "torch.multiprocessing.set_start_method('fork')" in "train.py".
@deepmeng How long does the full training (20 epochs - 6 + 14) take on two 48 GB GPUs?
Hi @deepmeng, I think I experienced a quite similar problem. I trained the Stage 2 (e2e) model with 3 GPUs (4090), 64GB RAM, and 64GB swap. At the beginning, it took about 87 GB of (RAM and swap) and kept increasing gradually. After 1 epoch (nuScenes), the memory got full and this caused an error.
I found out that reducing the workers_per_gpu (How many subprocesses to use for data loading for each GPU) can help.
I reduced it from 8 to 2, and surprisingly, the training speed remains the same, while the memory usage is significantly less than before.
I think one of the reasons is that the GPU's working speed is much slower than the data loading process, so the loaded data waiting to be processed will be put in a queue, and this gradually eats up all the memory.