ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: assert len(grad_args) > 0 in training the animate anyone

Open zhangvia opened this issue 6 months ago • 2 comments

🐛 Describe the bug

i'm using the colossalai to train the animate anyone. but there is an error in unet forward.

Traceback (most recent call last):
  File "core/train_stage2_colo.py", line 294, in <module>
    main(conf)
  File "core/train_stage2_colo.py", line 236, in main
    pred = unet(noisy_latents, timestep, encoder_hidden_states).sample
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 258, in forward
    outputs = self.module(*args, **kwargs)
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/74nvme/animate_zl/source/core/model/unet.py", line 378, in forward
    emb = self.time_embedding(t_emb)
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/diffusers/models/embeddings.py", line 192, in forward
    sample = self.linear_1(sample)
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/colossalai/tensor/colo_parameter.py", line 61, in __torch_function__
    new_args = ColoParamOpHookManager.pre_op(params, *args, *kwargs.values())
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/colossalai/tensor/param_op_hook.py", line 88, in pre_op
    grad_args, other_args, grad_flags, spec = _flatten_grad_args(args)
  File "/media/74nvme/software/miniconda3/envs/animate/lib/python3.8/site-packages/colossalai/tensor/param_op_hook.py", line 145, in _flatten_grad_args
    assert len(grad_args) > 0
AssertionError

the part of unet parameters requires_grad_(False), i don't know if it has something to do with this error

Environment

No response

zhangvia avatar Jan 05 '24 06:01 zhangvia

I got the same error: https://github.com/hpcaitech/ColossalAI/issues/5290. Have you solved it? @zhangvia

ericxsun avatar Jan 20 '24 12:01 ericxsun

I got the same error: #5290. Have you solved it? @zhangvia

no。i think the plugin can not support part of layers freezed training.

zhangvia avatar Jan 22 '24 03:01 zhangvia