CogVideo KeyError: 'shadow'

System Info / 系統信息

linux

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

full-parameter fine-tuning on 2 h800 gpu

torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM

Expected behavior / 期待表现

solve KeyError: 'shadow' problem

Aug 27 '24 08:08 SpringtoString

如果你是2B模型，你应该使用FP16微调，此外，这个错误我没有遇到过，单卡正常吗

Aug 27 '24 12:08 zRzRzRzRzRzRzR

如果你是2B模型，你应该使用FP16微调，此外，这个错误我没有遇到过，单卡正常吗

你试过单卡sft微调吗，我这边是爆显存了

Aug 29 '24 02:08 SpringtoString

Hi. I met the same error. Is there any solution?

bit16_partitions[partition_id].data.copy_(state['shadow'].data) KeyError: 'shadow'

Sep 10 '24 19:09 hw-liang

Same issue on A100 80G when tuning with new parameters added.

Sep 12 '24 21:09 TianxingWu

We recommend using the fine-tuning code provided by the diffusers version, which we will release in early October. This issue will be closed as it cannot be reproduced

Sep 27 '24 11:09 zRzRzRzRzRzRzR

We recommend using the fine-tuning code provided by the diffusers version, which we will release in early October. This issue will be closed as it cannot be reproduced

这个问题我也遇到了，建议还是用推荐的bf16或fp16

Jan 03 '25 03:01 Xushuolin

我解决了这个问题,因为fp16最大只支持65536在训练前十几个迭代都会overflow不会反向传播,导致没有更新这个键,你只需要把save_interval改大一点在overflow后save就可以了

Mar 07 '25 08:03 Tacossp

CogVideo CogVideo copied to clipboard

KeyError: 'shadow'

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

CogVideo
CogVideo copied to clipboard