Tune-A-Video icon indicating copy to clipboard operation
Tune-A-Video copied to clipboard

unet.down_blocks does not seem to be updating

Open czk32611 opened this issue 1 year ago • 4 comments

I ran the training code with two GPUs and got error message Parameters which did not receive grad for rank 0 down_blocks.2.attentions.1.transformer_blocks.0.attn_temp.to_out.0.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn_temp.to_v.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn_temp.to_k.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn_temp.to_q.weight, xxx.

I double checked and found that down_blocks.0.attentions.0.transformer_blocks.0.attn_temp.to_v.weight is always zero.

This issue may be resulted from torch.utils.checkpoint.checkpoint in https://github.com/showlab/Tune-A-Video/blob/main/tuneavideo/models/unet_blocks.py#L300.

Reference: https://github.com/huggingface/transformers/issues/21381 "gradient checkpointing disables requires_grad when freezing part of models (fix with use_reentrant=False)"

Can someone confirm if this issue exists and provide a brief update?

czk32611 avatar May 15 '23 03:05 czk32611

can you check if this issue still occurs when training on one gpu?

zhangjiewu avatar May 15 '23 05:05 zhangjiewu

can you check if this issue still occurs when training on one gpu?

It also occurs when training on one gpu with no warning or error. The trainable modules in down_blocks stills have no grad.

Quick checking: You can find unet.down_blocks[2].attentions[1].transformer_blocks[0].attn_temp.to_out[0].weight are always zeros and unet.down_blocks[2].attentions[1].transformer_blocks[0].attn_temp.to_out[0].weight.grad is alwasy None

czk32611 avatar May 15 '23 07:05 czk32611

I'm facing the same issue: the network seems not being updated when trainning. The gradients of the trainable modules are always zero. Can anyone resolve the problem?

DuanXiaoyue-LittleMoon avatar Jun 27 '23 11:06 DuanXiaoyue-LittleMoon

It is because you enable 'torch.utils.checkpoint' to save GPU memory. If you want to update learnable modules in checkpoint function, you must ensure your input tensor.requires_grad = True since Tune-A-Video uses the default parameter 'use_reentrant=True' in torch.utils.checkpoint function. You know why mid layers and up layers can be updated ? Since mid layers do not use torch.utils.checkpoint function, so trainable parameters make hidden_states tensor requires_grad=True, then up layers can be updated through use torch.utils.checkpoint function.

The simplest way to solve this is to set 'gradient_checkpointing: False', if your GPU has a sufficient memory🙃. Good luck! screenshot

guomc9 avatar Jun 27 '24 11:06 guomc9