Problems in multi-card distributed training
The problem of distributed training blocking
Steps to Reproduce
1、Minimum code block
from otx.engine import Engine
engine = Engine(model="yolox_s", data_root="pwd") engine.train(num_nodes=2)
2.I tried other code to troubleshoot my environment.
import lightning as L from lightning.pytorch.demos.boring_classes import BoringModel
ngpus = 2 model = BoringModel() trainer = L.Trainer(max_epochs=10, devices=ngpus)
trainer.fit(model)
log: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
distributed_backend=nccl All distributed processes registered. Starting with 2 processes
Environment:
- OS:
- Framework version:
- Python version: 3.10
- OpenVINO version:
- CUDA/cuDNN version: 12.2
- GPU model and memory: 24G(4090)*2
When I run it, it will card the owner. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
@eunwoosh could you take a look at this issue?
I found some open issues related to this and commented on them.
- https://github.com/Lightning-AI/torchmetrics/issues/2477
- https://github.com/Lightning-AI/pytorch-lightning/issues/18803
Hi @nowbug , thanks for finding the issue. First of all I want to say that OTX 2.0 currently doesn't validate distributed training, so it can be a little bit unstable. Nevertheless, OTX is based on pytorch lightning, so I think distributed training is available in most cases. OTX have a plan to support distributed training in the near future, so it can become stable soon. And I tested with your second code snippet, and I found a bug as @harimkang said. So, I opened PR to fix it. I also found that distributed training is stuck in some cases, and I suspect number of dataset is cause of the problem. I'll fix that bug after finding more.
@eunwoosh Thank you for your response. I'm looking forward to the upcoming versions of OTX.