how 14B T2V full training? on 80GB H100 gpu
I meet OOM problem.
python examples/wanvideo/train_wan_t2v.py
--task train
--train_architecture full
--dataset_path xxx
--output_path ./models_results
--dit_path "models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00001-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00002-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00003-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00004-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00005-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00006-of-00006.safetensors"
--steps_per_epoch 500
--max_epochs 10
--learning_rate 1e-4
--accumulate_grad_batches 1
--use_gradient_checkpointing \
anyone help me, please?
I use --use_gradient_checkpointing
--use_gradient_checkpointing_offload
--training_strategy "deepspeed_stage_2" and can full fine-tune
how to infer?
for example, how to use lightning_logs/version_27/checkpoints/epoch=0-step=22.ckpt/zero_to_fp32.py? and how to load the converted ckpt?
how to infer?
for example, how to use lightning_logs/version_27/checkpoints/epoch=0-step=22.ckpt/zero_to_fp32.py? and how to load the converted ckpt?
I tried this code, but I got the issue like RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive. Did you meet this issue and solve it?
I use --use_gradient_checkpointing 我使用 --use_gradient_checkpointing --use_gradient_checkpointing_offload --training_strategy "deepspeed_stage_2" and can full fine-tune--training_strategy “deepspeed_stage_2” 并且可以完全微调
Hello, I also implemented lora and full training based on deepspeed_stage_2. But I found that batch_size can only be set to 1. I would like to ask if you have encountered similar problems and have solved them? In addition, if possible, can you communicate with Wanx's fune on a daily basis?
I use --use_gradient_checkpointing --use_gradient_checkpointing_offload --training_strategy "deepspeed_stage_2" and can full fine-tune
请问多卡情况下需要用accelerate/deepspeed launch吗