使用deepspeed zero_stage 3训练,遇到张量shape不一致的问题该怎么解决?
同样的工程使用deepspeed zero_stage 2训练,不会出现shape不一致的问题,但会出现显存OOM。 推测是zero_stage 3分片后导致张量shape改变,求助解决方案
class CausalConv3d(nn.Conv3d) 报错:
@Artiprocher 期待你的解答!
出现一样的问题,蹲解决方案
我也遇到了这个同样的问题,在deepspeed zero3阶段,只用zero2 又只能17帧
any solution to this ?
同问,在a6000*2的配置上使用deepspeed zero3会在 DiffSynth-Studio/diffsynth/models/wan_video_vae.py line 58: x = torch.cat([cache_x, x], dim=2) 报错 RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 640 but got size 160 for tensor number 1 in the list.
I suffered the same problem.
================================= [rank0]: File "/workspace/DiffSynth-Studio/diffsynth/models/wan_video_vae.py", line 48, in forward [rank0]: x = torch.cat([cache_x, x], dim=2) [rank0]: RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 384 but got size 96 for tensor number 1 in the list.
蹲,遇到了同样的问题
蹲
I suffered the same problem
蹲,同样的问题,zero2能跑,zero3有问题
问题原因在于vae的feat_cache和zero3冲突,但是feat_cache的创建位置太零碎,还有各种shape操作,巨难改。。。
我的解决方案,不优雅:
模型加载后把vae挪到外面WanTrainingModule里,pipe.vae置空,然后accelerate.prepare传入model.pipe,而不是model,另外需要修改所有用到vae的PipeUnit,把vae作为参数传进去,例如:
vae就成功脱离zero3了!
注:得记得手动把vae挪到正确的device上
Hi @MaxwellDing Thank you for sharing your solution. I followed your solution with zero_3 and use_gradient_checkpointing enabled, encountering the following error. Do you have any insights to share? Thank you very much!
[rank0]: raise CheckpointError( [rank0]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass. [rank0]: tensor at position 13: [rank0]: saved metadata: {'shape': torch.Size([1536]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: tensor at position 23: [rank0]: saved metadata: {'shape': torch.Size([1536]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: tensor at position 52: [rank0]: saved metadata: {'shape': torch.Size([1536]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: tensor at position 62: [rank0]: saved metadata: {'shape': torch.Size([1536]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: tensor at position 91: [rank0]: saved metadata: {'shape': torch.Size([1536]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: tensor at position 101: [rank0]: saved metadata: {'shape': torch.Size([1536]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: tensor at position 130: [rank0]: saved metadata: {'shape': torch.Size([1536]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: tensor at position 140: [rank0]: saved metadata: {'shape': torch.Size([1536]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)} [rank0]: recomputed metadata: {'
@zoezhou1999 ,Hi there, I haven't encountered that error. However, if the VAE is set to be trainable, my solution might not work.
问题原因在于vae的feat_cache和zero3冲突,但是feat_cache的创建位置太零碎,还有各种shape操作,巨难改。。。
我的解决方案,不优雅:
模型加载后把vae挪到外面WanTrainingModule里,pipe.vae置空,然后accelerate.prepare传入model.pipe,而不是model,另外需要修改所有用到vae的PipeUnit,把vae作为参数传进去,例如:
vae就成功脱离zero3了!
注:得记得手动把vae挪到正确的device上
所以对于全量训练来说,TI2V为什么只训练中间的dit,text ecoder和vae好像都没训?看不太懂这个:
问题原因在于vae的feat_cache和zero3冲突,但是feat_cache的创建位置太零碎,还有各种shape操作,巨难改。。。 我的解决方案,不优雅: 模型加载后把vae挪到外面WanTrainingModule里,pipe.vae置空,然后accelerate.prepare传入model.pipe,而不是model,另外需要修改所有用到vae的PipeUnit,把vae作为参数传进去,例如:
vae就成功脱离zero3了! 注:得记得手动把vae挪到正确的device上
所以对于全量训练来说,TI2V为什么只训练中间的dit,text ecoder和vae好像都没训?看不太懂这个:
![]()
一般来说sft是只训dit的呀,具体训练流程可以看看论文
vae就成功脱离zero3了!