DiffSynth-Studio icon indicating copy to clipboard operation
DiffSynth-Studio copied to clipboard

how 14B T2V full training? on 80GB H100 gpu

Open huangjch526 opened this issue 8 months ago • 6 comments

I meet OOM problem. python examples/wanvideo/train_wan_t2v.py
--task train
--train_architecture full
--dataset_path xxx
--output_path ./models_results
--dit_path "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00001-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00002-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00003-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00004-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00005-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00006-of-00006.safetensors"
--steps_per_epoch 500
--max_epochs 10
--learning_rate 1e-4
--accumulate_grad_batches 1
--use_gradient_checkpointing \

huangjch526 avatar May 01 '25 06:05 huangjch526

anyone help me, please?

huangjch526 avatar May 01 '25 06:05 huangjch526

I use --use_gradient_checkpointing
--use_gradient_checkpointing_offload
--training_strategy "deepspeed_stage_2" and can full fine-tune

huangjch526 avatar May 01 '25 09:05 huangjch526

how to infer?

for example, how to use lightning_logs/version_27/checkpoints/epoch=0-step=22.ckpt/zero_to_fp32.py? and how to load the converted ckpt?

huangjch526 avatar May 01 '25 09:05 huangjch526

how to infer?

for example, how to use lightning_logs/version_27/checkpoints/epoch=0-step=22.ckpt/zero_to_fp32.py? and how to load the converted ckpt?

I tried this code, but I got the issue like RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive. Did you meet this issue and solve it?

TumCCC avatar May 11 '25 10:05 TumCCC

I use --use_gradient_checkpointing 我使用 --use_gradient_checkpointing --use_gradient_checkpointing_offload --training_strategy "deepspeed_stage_2" and can full fine-tune--training_strategy “deepspeed_stage_2” 并且可以完全微调

Hello, I also implemented lora and full training based on deepspeed_stage_2. But I found that batch_size can only be set to 1. I would like to ask if you have encountered similar problems and have solved them? In addition, if possible, can you communicate with Wanx's fune on a daily basis?

zfw-cv avatar May 16 '25 03:05 zfw-cv

I use --use_gradient_checkpointing --use_gradient_checkpointing_offload --training_strategy "deepspeed_stage_2" and can full fine-tune

请问多卡情况下需要用accelerate/deepspeed launch吗

feiyu12138 avatar May 21 '25 03:05 feiyu12138