Megatron-LM
Megatron-LM copied to clipboard
AttributeError: 'Parameter' object has no attribute 'main_grad'
When I train the model, if some modules (parameters) in the model are not involved in the current forward propagation calculation, then the parameters in these modules are not gradient calculated during the back propagation.At this point, the error in the title appears.This error occurs in the optimizer.py file.
I got this problem too. Did you solve it ?
@xyltt @DY-TL I encountered this error too. Below is the trace. Did any of you find out the cause and fix for this? Appreciate any pointers.
File "/workspace/prraman/megatron/optimizer.py", line 384, in step self._copy_model_grads_to_main_grads() File "/workspace/prraman/megatron/optimizer.py", line 311, in _copy_model_grads_to_main_grads main_param.grad = model_param.main_grad.float() AttributeError: 'Parameter' object has no attribute 'main_grad' Traceback (most recent call last):
Essentially, error occurs when trying to access main_grad
attribute inside model_param
object in this line:
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/optimizer/optimizer.py#L316
@jaredcasper Just bumping this up to your attention in case you have any recommendations to fix it. Thanks
Following up with some debugging on this, I found that the flag params_have_main_grad
which causes the above code to look for main_grad
attribute is set here (https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/optimizer/init.py#L71,L73) in the init.py of the optimizer module. Basically, the flag is set when args.DDP_impl == 'local'
.
However, I find that in in megatron/model/distributed.py where the main_grad
attribute is actually initialized, I find that it is created only when self.use_contiguous_buffers flag
is also set (https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/distributed.py#L124).
Could that be a possible cause for this error? Do both flags need to be set for main_grad
to be allocated and used?
In my case args.DDP_impl
is set to "local" but args.use_contiguous_buffers_in_ddp
flag is "false".
P.S. above error occurs even when running the code serially (i.e. with tensor_model_parallel_size and pipeline_model_parallel_size set to 1)
I want to train a model with an unified encoder and two seperate decoders, then I got this problem. @xyltt @DY-TL @ParamsRaman @jaredcasper Did you solve it ? Appreciate any pointers, thanks.
Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.
@xyltt Sorry if this is too late. Add --no-gradient-accumulation-fusion
to the arguments, and it will probably solve your problem.
@xyltt Sorry if this is too late. Add
--no-gradient-accumulation-fusion
to the arguments, and it will probably solve your problem.
we use --no-gradient-accumulation-fusion. but it does not work . Do you have any idea ?
Marking as stale. No activity in 60 days.
You need to make sure that every network layer you define is used, otherwise this error will be reported. You just need to check if there are some network layers that are not used and delete those that are not used.
Marking as stale. No activity in 60 days.