ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ddp training in diffusion

Open zhangvia opened this issue 1 year ago • 1 comments

🐛 Describe the bug

how can i use the ddp train in diffusion? i saw the train_ddp.yaml,but there is nothing different with the train_colossalai.yaml. how do i set the numbers of gpu and nodes or the port of nodes? do you have any docs about these?

Environment

No response

zhangvia avatar Apr 19 '23 03:04 zhangvia

The two configurations are actually different. You may change settings from this line onwards. To run the codes, you may execute python main.py --logdir /tmp/ --train --base configs/train_colossalai.yaml --ckpt 512-base-ema.ckpt as per our guide.

JThh avatar Apr 20 '23 10:04 JThh

Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.

NatalieC323 avatar Apr 26 '23 02:04 NatalieC323

Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.

i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice

zhangvia avatar Apr 26 '23 02:04 zhangvia

Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.

i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice

The stable diffusion is constructed based on the Pyorch Lightning Structure. For the detailed usage, please refer to the link: https://lightning.ai/docs/pytorch/latest/api/lightning.pytorch.trainer.trainer.Trainer.html#lightning.pytorch.trainer.trainer.Trainer

NatalieC323 avatar Apr 26 '23 05:04 NatalieC323

Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.

i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice

The stable diffusion is constructed based on the Pyorch Lightning Structure. For the detailed usage, please refer to the link: https://lightning.ai/docs/pytorch/latest/api/lightning.pytorch.trainer.trainer.Trainer.html#lightning.pytorch.trainer.trainer.Trainer

thanks for your reply, so is colossalai a startegy about reducing training gpu memory on single device? does it help with the distributed training on multiple nodes?

zhangvia avatar Apr 26 '23 06:04 zhangvia

Hi @zhangvia Colossal-AI is designed for distributed training on multiple nodes, but some of our features are also applicable to single GPUs or single nodes.

binmakeswell avatar Apr 27 '23 07:04 binmakeswell

Hi @zhangvia Colossal-AI is designed for distributed training on multiple nodes, but some of our features are also applicable to single GPUs or single nodes.

so, when i use the colossal-ai strategy in pytorch lightning, can i get all features of colossal-ai?

zhangvia avatar May 05 '23 07:05 zhangvia