CogVideo
CogVideo copied to clipboard
KeyError: 'shadow'
System Info / 系統信息
linux
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
full-parameter fine-tuning on 2 h800 gpu
torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM
Expected behavior / 期待表现
solve KeyError: 'shadow' problem
如果你是2B模型,你应该使用FP16微调,此外,这个错误我没有遇到过,单卡正常吗
如果你是2B模型,你应该使用FP16微调,此外,这个错误我没有遇到过,单卡正常吗
你试过单卡sft微调吗,我这边是爆显存了
Hi. I met the same error. Is there any solution?
bit16_partitions[partition_id].data.copy_(state['shadow'].data) KeyError: 'shadow'
Same issue on A100 80G when tuning with new parameters added.
We recommend using the fine-tuning code provided by the diffusers version, which we will release in early October. This issue will be closed as it cannot be reproduced
We recommend using the fine-tuning code provided by the diffusers version, which we will release in early October. This issue will be closed as it cannot be reproduced
这个问题我也遇到了,建议还是用推荐的bf16或fp16
我解决了这个问题,因为fp16最大只支持65536在训练前十几个迭代都会overflow不会反向传播,导致没有更新这个键,你只需要把save_interval改大一点在overflow后save就可以了