T2I-Adapter icon indicating copy to clipboard operation
T2I-Adapter copied to clipboard

torch.nn.parallel.DistributedDataParallel hang on

Open Crd1140234468 opened this issue 1 year ago • 5 comments

I encountered ‘torch.nn.parallel.DistributedDataParallel hang on’ problemwhen I run the train_depth.py. I found that the program cannot enter the statement "dist._verify_model_across_ranks" image How to solve this problem

Crd1140234468 avatar Sep 21 '23 08:09 Crd1140234468

This is a function inside torch

Crd1140234468 avatar Sep 21 '23 08:09 Crd1140234468

Also, here's the problem I'm having with multiple GPUs

Crd1140234468 avatar Sep 21 '23 08:09 Crd1140234468

what's the command you run?

MC-E avatar Sep 22 '23 14:09 MC-E

what's the command you run?

CUDA_VISIBLE_DEVICES=1,3 python -m torch.distributed.launch --nproc_per_node=2 --master_port 8888 test11.py --bsize=8

Crd1140234468 avatar Sep 25 '23 05:09 Crd1140234468

what's the command you run? Currently, model_ad can be loaded into torch.nn.parallel.DistributedDataParallel, but when the model is set to sd-v1-4.ckpt, it cannot be loaded into torch.nn.parallel.DistributedDataParallel, and it will get stuck.

Crd1140234468 avatar Sep 25 '23 05:09 Crd1140234468