gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

Fixing MuP

Open marcobellagente93 opened this issue 1 year ago • 8 comments

Current MuP implementation in neox is buggy. This PR allows to get the main functionalities without major changes to the code. Current limitations:

  • only supports non-tied models
  • does not immediately supports embedding multipliers
  • does not immediately supports attention factor multipliers and 1/d**2 attention scaling

The main issue in the current code is that the model is always initialized with use_mup = False, which is then set to its correct value later. This doesn't work, as it sets the wrong attribute at the init of all classes, meaning that effectively it never used mup. Best solution would be to loop through all modules and set the correct attribute there. Current workaround provides a minimal modification whereby the attribute is reset at the re-init of the linear layers, meaning it does the correct thing for everything except for the self attention and embedding

A second issue is that the code as is expects the mu-optimizer to provide with the correct multiplier of target_width/base_width but this is not provided in the mup library. We should probably just open a PR on mup and get rid of this. As the fastest solution, mup is added to the repo, with the multiplier added to the optimizer dict. Plus, removed torch dependancies for mup cause that's useless and can only lead to issues.

Plots are also added for testing the implementation:

  • coordinate checker
  • wider always better

marcobellagente93 avatar Oct 19 '23 10:10 marcobellagente93

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Oct 19 '23 10:10 CLAassistant

@nsarka -- FYI

Quentin-Anthony avatar Oct 19 '23 14:10 Quentin-Anthony

Instead of incorporating muP into GPT-NeoX we are going to move these changes to our fork of their repo and install that version until the changes are upstreamed.

StellaAthena avatar Oct 19 '23 17:10 StellaAthena

Instead of incorporating muP into GPT-NeoX we are going to move these changes to our fork of their repo and install that version until the changes are upstreamed.

Not all of his changes are muP-related. I've separated out the muP 1-line change into our fork until https://github.com/microsoft/mup/pull/65 is merged. We can discuss the GPT-NeoX specific changes here and remove the mup subdir.

Quentin-Anthony avatar Oct 19 '23 20:10 Quentin-Anthony

Instead of incorporating muP into GPT-NeoX we are going to move these changes to our fork of their repo and install that version until the changes are upstreamed.

Not all of his changes are muP-related. I've separated out the muP 1-line change into our fork until microsoft/mup#65 is merged. We can discuss the GPT-NeoX specific changes here and remove the mup subdir.

Oh I see, I read the previous discussion backwards (thinking it was a 1-line NeoX edit and a substantial muP edit). I went ahead and removed the muP changes (moving them to the fork) and imported the new muP library. I haven't had a chance to check the correctness of this implementation yet however.

StellaAthena avatar Oct 20 '23 05:10 StellaAthena

Instead of incorporating muP into GPT-NeoX we are going to move these changes to our fork of their repo and install that version until the changes are upstreamed.

Not all of his changes are muP-related. I've separated out the muP 1-line change into our fork until microsoft/mup#65 is merged. We can discuss the GPT-NeoX specific changes here and remove the mup subdir.

Oh I see, I read the previous discussion backwards (thinking it was a 1-line NeoX edit and a substantial muP edit). I went ahead and removed the muP changes (moving them to the fork) and imported the new muP library. I haven't had a chance to check the correctness of this implementation yet however.

That's fairly quick to verify, currently neox adjust the learning with the width here, but group['width'] doesn't exists, and since it's just in the if block it never raised an error

https://github.com/EleutherAI/gpt-neox/blob/b02d98932f95fe0500c28698b38acb175e92e980/megatron/learning_rates.py#L97C1-L101C37

marcobellagente93 avatar Oct 20 '23 07:10 marcobellagente93

That's fairly quick to verify, currently neox adjust the learning with the width here, but group['width'] doesn't exists, and since it's just in the if block it never raised an error

https://github.com/EleutherAI/gpt-neox/blob/b02d98932f95fe0500c28698b38acb175e92e980/megatron/learning_rates.py#L97C1-L101C37

It looks like this is unchanged from your branch? I thought your branch was working. Or am I missing something.

StellaAthena avatar Oct 21 '23 03:10 StellaAthena

That's fairly quick to verify, currently neox adjust the learning with the width here, but group['width'] doesn't exists, and since it's just in the if block it never raised an error https://github.com/EleutherAI/gpt-neox/blob/b02d98932f95fe0500c28698b38acb175e92e980/megatron/learning_rates.py#L97C1-L101C37

It looks like this is unchanged from your branch? I thought your branch was working. Or am I missing something.

It's fixed if we use the eleuther fork of mup where I added the width to the optimizer dict https://github.com/EleutherAI/mup/blob/14e436bc013418725976e7cfb1b4e74e8901ab80/mup/optim.py#L75C9-L80C52. Apologies for the confusion

marcobellagente93 avatar Oct 21 '23 23:10 marcobellagente93