Problem of training on the large dataset using the DDPPlugin after caching

Open polcomicepute opened this issue 2 years ago • 1 comments

Hi, I've encountered a failure while attempting to train on the large dataset(19M) using the DDPPlugin after caching in local. The loading time for the complete dataset from cache files exceeds 30 minutes , resulting in an error originating from 'torch/distributed/distributed_c10d.py:460':

INFO {/opt/conda/envs/nuplan/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:460} Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00) Timed out initializing process group in store-based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)

I have measured the time it takes for data loading(approximately 19 million datasets) in different sections of the extract_scenarios_from_cache function within the scenario_builder.py file, and the results were as follows:

During the process of placing a barrier and synchronizing until processing on all other GPUs is finished, it seems that a prolonged duration is being consumed while reading the candidate_scenario_dirs path. This extended time consumption leads to a timeout and results in a failure during the preprocessing stage of training.)

I suspect that the large dataset is the underlying reason for this issue. This problem might potentially be resolved by manipulating the timeout arguments introduced in DDPStrategy in PyTorch Lightning version 2.0.7. However, the current version being used by Nuplan is PyTorch Lightning 1.4.9, which employs the DDPPlugin.

Could you give me some advice? Thanks for your assistance with the excellent dataset!

Aug 23 '23 06:08 polcomicepute

This method works by modifying the initialization of the pytorch lightning library. my pytorch-lightning version is 1.3.8

/usr/local/lib/python3.9/dist-package/pytorch_lightning/plugins/training_type/DDP.py

Add from datetime import timedelta
Add timeout=timedelta(seconds=1800) to the torch_distrib.init_process_group function in DDPPlugin Classes.

torch_distrib.init_process_group(..., word_size=word_size, timeout=timedelta(seconds=1800) )

Feb 14 '25 01:02 han1222