Megatron-LM AttributeError: 'Parameter' object has no attribute 'main

When I train the model, if some modules (parameters) in the model are not involved in the current forward propagation calculation, then the parameters in these modules are not gradient calculated during the back propagation.At this point, the error in the title appears.This error occurs in the optimizer.py file.

Jun 27 '21 11:06 xyltt

I got this problem too. Did you solve it ?

Nov 19 '21 07:11 DY-TL

@xyltt @DY-TL I encountered this error too. Below is the trace. Did any of you find out the cause and fix for this? Appreciate any pointers.

File "/workspace/prraman/megatron/optimizer.py", line 384, in step self._copy_model_grads_to_main_grads() File "/workspace/prraman/megatron/optimizer.py", line 311, in _copy_model_grads_to_main_grads main_param.grad = model_param.main_grad.float() AttributeError: 'Parameter' object has no attribute 'main_grad' Traceback (most recent call last):

Essentially, error occurs when trying to access main_grad attribute inside model_param object in this line: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/optimizer/optimizer.py#L316

@jaredcasper Just bumping this up to your attention in case you have any recommendations to fix it. Thanks

Nov 22 '21 05:11 ParamsRaman

Following up with some debugging on this, I found that the flag params_have_main_grad which causes the above code to look for main_grad attribute is set here (https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/optimizer/init.py#L71,L73) in the init.py of the optimizer module. Basically, the flag is set when args.DDP_impl == 'local'.

However, I find that in in megatron/model/distributed.py where the main_grad attribute is actually initialized, I find that it is created only when self.use_contiguous_buffers flag is also set (https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/distributed.py#L124).

Could that be a possible cause for this error? Do both flags need to be set for main_grad to be allocated and used? In my case args.DDP_impl is set to "local" but args.use_contiguous_buffers_in_ddp flag is "false".

P.S. above error occurs even when running the code serially (i.e. with tensor_model_parallel_size and pipeline_model_parallel_size set to 1)

Nov 22 '21 18:11 ParamsRaman

I want to train a model with an unified encoder and two seperate decoders, then I got this problem. @xyltt @DY-TL @ParamsRaman @jaredcasper Did you solve it ? Appreciate any pointers, thanks.

Apr 26 '22 16:04 cnksyyy

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

Jul 10 '23 18:07 github-actions[bot]

@xyltt Sorry if this is too late. Add --no-gradient-accumulation-fusion to the arguments, and it will probably solve your problem.

Jul 18 '23 01:07 gajagajago

@xyltt Sorry if this is too late. Add --no-gradient-accumulation-fusion to the arguments, and it will probably solve your problem.

we use --no-gradient-accumulation-fusion. but it does not work . Do you have any idea ?

Jul 31 '23 09:07 youhebuke

Marking as stale. No activity in 60 days.

Sep 29 '23 18:09 github-actions[bot]

You need to make sure that every network layer you define is used, otherwise this error will be reported. You just need to check if there are some network layers that are not used and delete those that are not used.

Jan 28 '24 09:01 exitxingling

Marking as stale. No activity in 60 days.

Mar 28 '24 18:03 github-actions[bot]

Megatron-LM Megatron-LM copied to clipboard

AttributeError: 'Parameter' object has no attribute 'main_grad'

Megatron-LM
Megatron-LM copied to clipboard