Megatron-DeepSpeed extract and log grad norm for individual layers

the paper NormFormer: Improved Transformer Pretraining with Extra Normalization https://arxiv.org/abs/2110.09456 suggests that under preLN:

gradients at earlier layers tend to be larger than gradients at later layers

so we want to verify that this is so in our case before acting on it and potentially integrating NormFormer.

So we need to expand the tensorboard and logs to log grad norm for individual layers, perhaps as in the paper we can log 5 layers: 0, 1, int(n_layers/2), -2, -1

For reference see p6 in the paper (graphs and discussion).

Currently only L2 average of all layer grad norms is calculated and logged (a single number).

Performance-wise this will require doing some extra calculations but not much - it's really just figuring out how to broadcast this info to TB so that multiple-points can be logged at once.

Additionally, we could activate this tool on demand - e.g. after encountering a spike we could roll back to the last good checkpoint and run a cycle with this debug feature enabled.

Oct 21 '21 15:10 stas00

My initial thought was that maybe see where the grad norm we're currently logging is coming from, and see if we can get more granular layer-wise estimates from the internals. I've done a quick scan of our repo and it came down to

grad_norm = model[0].get_global_grad_norm()

And model[0] is an instance of deepspeed.PipelineEngine, which inherits from deepspeed.DeepSpeedEngine. This class gets the get_global_grad_norm from its optimizer.

if hasattr(self.optimizer, '_global_grad_norm'):
    self._global_grad_norm = self.optimizer._global_grad_norm

But I got lost after this. I assumed that the optimizer we're using is from apex, but at least based on my cursory search of the apex repo, I wasn't able to verify that apex optimizers have the _global_grad_norm attribute. I was able to see global_grad_norm in some CUDA files.

Is this the right direction of research?

Oct 23 '21 04:10 jaketae

Using a good visual debugger is the best way to do this kind of work. PyCharm is free and excellent. But surely there are others.

The grad norms are gathered here: https://github.com/microsoft/DeepSpeed/blob/0b77d1d98a7b6f96937cd70db2adf6ef19062ba5/deepspeed/runtime/zero/stage2.py#L1638-L1682

self.averaged_gradients (in the code snippet above) contains the grad averages for all params

note: we use ZeRO-1 here, but it uses the same code paths as ZeRO-2 mostly, therefore the code goes through zero/stage2.py.

The L2 global norm is calculated here: https://github.com/microsoft/DeepSpeed/blob/0b77d1d98a7b6f96937cd70db2adf6ef19062ba5/deepspeed/runtime/utils.py#L299-L305

The hard part is to figure out how to remap those piles of tensors in self.averaged_gradients back to the layers/submodule they belong to.

deepspeed packs params into groups matching optimizer.param_groups

self.param_shapes in train_batch can be helpful to see the names and shapes of the params.

Oct 23 '21 20:10 stas00

@tjruwase, do you have any suggestions as to how we could change stage2 to optionally calculate grad norm per layer and not just the L2 average over all gradients? The difficultly is to remap back from the param groups to layers.

It looks like self.averaged_gradients is what we want but there is no map to how to slice those by layer. But this is getting warm since I checked that it contains all the params.

Additionally logging min/max grad of each layer might be useful info as well.

context: We are trying to add a new diagnostic to logs+tensorboard which would log grad norm per layer, since the NormFormer paper (https://arxiv.org/abs/2110.09456) indicates that the instabilities come from different layers having different magnitudes of grad norm and thus require individual scaling.

And I think min/max would be useful as well.

The plan is to log just a few layers, but it's probably the easiest to first figure out how to calculate it for all layers.

Thanks!

Oct 23 '21 20:10 stas00

@jaketae, the other simpler approach is to not touch Deepspeed and develop a generic tool that will work with any framework or model. Please have a look at this tool I have developed for params under/overflow debug: https://github.com/huggingface/transformers/blob/95bab53868a91b4809bd5281a72b5f326853e31f/src/transformers/debug_utils.py#L28

So we want a very similar thing, except for backward, so instead of using register_module_forward_hook we want register_module_full_backward_hook: https://pytorch.org/docs/stable/generated/torch.nn.modules.module.register_module_full_backward_hook.html#torch.nn.modules.module.register_module_full_backward_hook

And report min/max and additional norm of each layer.

And then we can optimize it to optionally only run for a few layers instead of all.

Then this tool can be plugged into Megatron at will when we need the debug info.

Oct 24 '21 00:10 stas00

@stas00 Thank you for the detailed writeup. I'll take a look at the tool you built and see if I can figure out a way to apply it here. I also remember you leaving a link to a writeup about the tool, so I might read that as well. I'll ping you if I run into any blockers.

Oct 24 '21 04:10 jaketae

Re: doing it separately: Try the weights tool first. There is a detailed guide here: https://huggingface.co/transformers/master/debugging.html

So we want the same but for gradients, the concept is identical - but instead of params you are getting gradients as arguments.

Re: deepspeed way, it shouldn't take long to merge if it is a reasonable proposal and it resonates with the Deepspeed team, but it'd be difficult to tweak at will, hence while I think it'd be a useful feature to others, doing it via pytorch hooks is probably much simpler for us to experiment with, at least until we find what we really need to dump.

And yes, if you encounter any issues please don't hesitate to ping me.

Thank you for working on that feature, Jake.

Oct 24 '21 05:10 stas00

I might have gotten ahead of myself, but this is roughly the idea had in mind after skimming through the code and PyTorch documentation. This is obviously a product of just a few minutes of thinking and is a very early prototype that may undergo drastic changes (or even completely scraped, if I got something totally wrong).

class DebugGradientNorm:
    def __init__(self, model, layers=[]):
        self.model = model
        self.layers = layers
        # TODO: check layer indices are valid

    def backward_hook(self, module, grad_inputs, grad_outputs):
        total_grad_norm = sum(grad_output.norm() for grad_output in grad_outputs)
        print(total_grad_norm.item())
        # TODO: print which layer the message is coming from

    def register_backward_hook(self):
        for layer_idx in self.layers:
            self.model[layer_idx].apply(self._register_backward_hook)

    def _register_backward_hook(self, module):
        if hasattr(module, "register_full_backward_hook"):
            module.register_full_backward_hook(self.backward_hook)
        else:
            module.register_backward_hook(self.backward_hook)

I used the following dummy code to check the functionality of the class.

def run():
    model = nn.Sequential(
        nn.Linear(10, 10),
        nn.GELU(),
        nn.Linear(10, 10),
    ).cuda()
    debug_gradient = DebugGradientNorm(model, [0, 1])
    debug_gradient.register_backward_hook()
    inputs = torch.randn(8, 10, 10).cuda()
    # dummy loss to get a scalar
    loss = model(inputs).sum()
    loss.backward()

Result:

14.654252052307129
9.012221336364746

At the moment, I have two outstanding questions.

How do we access individual layers in Megatron? Is it like a simple sequential model where I can index into it like a list? Or should it be a set of string keys?
Will PyTorch automatically account for the distributed nature of our setup?

Thank you @stas00!

Oct 24 '21 05:10 jaketae

I haven't tried your code but conceptually your code looks right, Jake.

I think it should just work regardless of how it's used. We aren't dealing with ZeRO-2 where grads are partitioned so we don't need to worry about gathering those.

The hook intercepts the backward call so you just get the access to the gradients before they are returned to the caller (the framework), so this is why this is a super neat way to sniff things out w/o touching the framework.

So in such situations you can pre-map the model to layers and then inside the hook you know which layer you're dealing with. As I haven't worked with grad hooks yet, I can't tell for sure. We can look together if you get stuck.

Oct 24 '21 06:10 stas00

Apparently torch modules are hashable, so we can put them into a dictionary as you suggested.

self.map = {model[layer]: layer for layer in layers}

And in the hook, we can use the map to figure out where the function was called.

print(self.map[module], total_grad_norm.item())

The dummy script produced the following output.

1 22.942493438720703
0 14.022110939025879

I think the order makes sense since backprop starts from the uppermost layer, hence the reverse order.

Bracket syntax conveniently works for both array-style indexing as well as dictionary key accessing, so it should be able to cover both layer indices as well as string named modules.

I'll go ahead and open a PR tomorrow, maybe create a new file in the tools directory?

Oct 24 '21 06:10 jaketae

What you said and please add the layer index. If you remember we might not log all layers but only sample a few to keep things faster and not too have too much info. So in general case it'd be all, with an option to have a user select which layer numbers they want. Like I proposed in OP: 0, 1, int(n_layers/2), -2, -1 (which is the reports in the NormFormer paper).

Finally, we don't want the sum of gradient norms, but most likely an L2 norm over all the gradients of the layer. https://github.com/microsoft/DeepSpeed/blob/0b77d1d98a7b6f96937cd70db2adf6ef19062ba5/deepspeed/runtime/utils.py#L299-L305 and my thinking is that logging min/max might be useful as well. That will help us better understand the clipping for example (which would happen later).

Perhaps another option for grad norm of specific param, in case we want to have even finer granularity. The original debug tool does it for each param.

We can add those later, let's start with one of the features and make it work.

As I'm writing this I'm thinking that we might still want to get into deepspeed, because the hook will give us grad data before it was post-processed and we may want that data too (e.g. after clipping)

To where to put it: I'd say replicate debug_utils.py from HF Transformers under megatron/debug_utils.py like we did with testing_utils.py, and then once the new feature is polished we can contribute it back to Transformers.

Down the road it might be worthwhile releasing it as a standalone library as well, since it is very generic (not framework dependent). But once it has more than one tool :)

Oct 24 '21 06:10 stas00