CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

KeyError: 'shadow'

Open SpringtoString opened this issue 1 year ago • 4 comments

System Info / 系統信息

linux

Information / 问题信息

  • [X] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

full-parameter fine-tuning on 2 h800 gpu

torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM

image

Expected behavior / 期待表现

solve KeyError: 'shadow' problem

SpringtoString avatar Aug 27 '24 08:08 SpringtoString

如果你是2B模型,你应该使用FP16微调,此外,这个错误我没有遇到过,单卡正常吗

zRzRzRzRzRzRzR avatar Aug 27 '24 12:08 zRzRzRzRzRzRzR

如果你是2B模型,你应该使用FP16微调,此外,这个错误我没有遇到过,单卡正常吗

你试过单卡sft微调吗,我这边是爆显存了

SpringtoString avatar Aug 29 '24 02:08 SpringtoString

Hi. I met the same error. Is there any solution?

bit16_partitions[partition_id].data.copy_(state['shadow'].data) KeyError: 'shadow'

hw-liang avatar Sep 10 '24 19:09 hw-liang

Same issue on A100 80G when tuning with new parameters added.

TianxingWu avatar Sep 12 '24 21:09 TianxingWu

We recommend using the fine-tuning code provided by the diffusers version, which we will release in early October. This issue will be closed as it cannot be reproduced

zRzRzRzRzRzRzR avatar Sep 27 '24 11:09 zRzRzRzRzRzRzR

We recommend using the fine-tuning code provided by the diffusers version, which we will release in early October. This issue will be closed as it cannot be reproduced

这个问题我也遇到了,建议还是用推荐的bf16或fp16

Xushuolin avatar Jan 03 '25 03:01 Xushuolin

我解决了这个问题,因为fp16最大只支持65536在训练前十几个迭代都会overflow不会反向传播,导致没有更新这个键,你只需要把save_interval改大一点在overflow后save就可以了

Tacossp avatar Mar 07 '25 08:03 Tacossp