AdvancedLiterateMachinery icon indicating copy to clipboard operation
AdvancedLiterateMachinery copied to clipboard

Omni如何进行分布式训练?

Open zlf0307 opened this issue 7 months ago • 0 comments
trafficstars

我使用如下命令进行多卡分布式训练: CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.run
main.py
--data_root ./text_spotting_datasets/
--output_folder ./output/pretrain/stage1/
--train_dataset totaltext_train mlt_train ic13_train ic15_train syntext1_train syntext2_train
--lr 0.0005
--max_steps 400000
--warmup_steps 5000
--checkpoint_freq 10000
--batch_size 6
--tfm_pre_norm
--train_max_size 768
--rec_loss_weight 2
--use_fpn
--use_char_window_prompt 但是实际上只有5号卡在训练,6号卡没有显存占用

zlf0307 avatar Apr 08 '25 06:04 zlf0307