training_extensions Problems in multi-card distributed training

The problem of distributed training blocking

Steps to Reproduce

1、Minimum code block

from otx.engine import Engine

engine = Engine(model="yolox_s", data_root="pwd") engine.train(num_nodes=2)

2.I tried other code to troubleshoot my environment.

import lightning as L from lightning.pytorch.demos.boring_classes import BoringModel

ngpus = 2 model = BoringModel() trainer = L.Trainer(max_epochs=10, devices=ngpus)

trainer.fit(model)

log: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

Environment:

OS:
Framework version:
Python version: 3.10
OpenVINO version:
CUDA/cuDNN version: 12.2
GPU model and memory: 24G(4090)*2

Jun 18 '24 07:06 nowbug

When I run it, it will card the owner. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

Jun 18 '24 07:06 nowbug

@eunwoosh could you take a look at this issue?

Jun 19 '24 00:06 harimkang

I found some open issues related to this and commented on them.

https://github.com/Lightning-AI/torchmetrics/issues/2477
https://github.com/Lightning-AI/pytorch-lightning/issues/18803

Jun 20 '24 05:06 harimkang

Hi @nowbug , thanks for finding the issue. First of all I want to say that OTX 2.0 currently doesn't validate distributed training, so it can be a little bit unstable. Nevertheless, OTX is based on pytorch lightning, so I think distributed training is available in most cases. OTX have a plan to support distributed training in the near future, so it can become stable soon. And I tested with your second code snippet, and I found a bug as @harimkang said. So, I opened PR to fix it. I also found that distributed training is stuck in some cases, and I suspect number of dataset is cause of the problem. I'll fix that bug after finding more.

Jun 20 '24 06:06 eunwoosh

@eunwoosh Thank you for your response. I'm looking forward to the upcoming versions of OTX.

Jun 20 '24 06:06 nowbug