training_extensions icon indicating copy to clipboard operation
training_extensions copied to clipboard

Problems in multi-card distributed training

Open nowbug opened this issue 1 year ago • 5 comments

The problem of distributed training blocking

Steps to Reproduce

1、Minimum code block

from otx.engine import Engine

engine = Engine(model="yolox_s", data_root="pwd") engine.train(num_nodes=2)

2.I tried other code to troubleshoot my environment.

import lightning as L from lightning.pytorch.demos.boring_classes import BoringModel

ngpus = 2 model = BoringModel() trainer = L.Trainer(max_epochs=10, devices=ngpus)

trainer.fit(model)

log: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

Environment:

  • OS:
  • Framework version:
  • Python version: 3.10
  • OpenVINO version:
  • CUDA/cuDNN version: 12.2
  • GPU model and memory: 24G(4090)*2

nowbug avatar Jun 18 '24 07:06 nowbug

When I run it, it will card the owner. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

nowbug avatar Jun 18 '24 07:06 nowbug

@eunwoosh could you take a look at this issue?

harimkang avatar Jun 19 '24 00:06 harimkang

I found some open issues related to this and commented on them.

  • https://github.com/Lightning-AI/torchmetrics/issues/2477
  • https://github.com/Lightning-AI/pytorch-lightning/issues/18803

harimkang avatar Jun 20 '24 05:06 harimkang

Hi @nowbug , thanks for finding the issue. First of all I want to say that OTX 2.0 currently doesn't validate distributed training, so it can be a little bit unstable. Nevertheless, OTX is based on pytorch lightning, so I think distributed training is available in most cases. OTX have a plan to support distributed training in the near future, so it can become stable soon. And I tested with your second code snippet, and I found a bug as @harimkang said. So, I opened PR to fix it. I also found that distributed training is stuck in some cases, and I suspect number of dataset is cause of the problem. I'll fix that bug after finding more.

eunwoosh avatar Jun 20 '24 06:06 eunwoosh

@eunwoosh Thank you for your response. I'm looking forward to the upcoming versions of OTX.

nowbug avatar Jun 20 '24 06:06 nowbug