mmdetection3d
mmdetection3d copied to clipboard
Dataloader blocked using distributed training
Hi, I meet a issue when training with distributed.
I added a custom pipeline module, the module load all maps from disk in image format all at once, so the module takes lot of memory. After I launched distributed training, I found the training process blocked at dataloader iteration, until the program crash.
If I set workers_per_gpu=1 (default equal to 6), the training can launch correctly, even though slowly.
I want to know the cause of problem, and how to fix it except set workers_per_gpu=1.
It's hard to say what is the problem exactly on my side. Maybe this is related to the IO/CPU setting or your machine. For example, your machine may not be able to afford such heavy data loading with the current configurations?