mmdetection3d icon indicating copy to clipboard operation
mmdetection3d copied to clipboard

Dataloader blocked using distributed training

Open xiaoxiao42 opened this issue 2 years ago • 1 comments

Hi, I meet a issue when training with distributed.

I added a custom pipeline module, the module load all maps from disk in image format all at once, so the module takes lot of memory. After I launched distributed training, I found the training process blocked at dataloader iteration, until the program crash.

If I set workers_per_gpu=1 (default equal to 6), the training can launch correctly, even though slowly.

I want to know the cause of problem, and how to fix it except set workers_per_gpu=1.

xiaoxiao42 avatar Jul 26 '22 02:07 xiaoxiao42

It's hard to say what is the problem exactly on my side. Maybe this is related to the IO/CPU setting or your machine. For example, your machine may not be able to afford such heavy data loading with the current configurations?

Tai-Wang avatar Aug 03 '22 06:08 Tai-Wang