What does this PR do?

Report gradient norm during training - Fixes #26143

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

@muellerzr @pacman100

Nov 06 '23 17:11 shijie-wu

cc @muellerzr

Nov 06 '23 18:11 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 01 '24 08:01 github-actions[bot]

Thank you for the work on this, @shijie-wu!

It may seem like a little PR to some, but this would be a huge step to bring transformers closer to parity with projects like gpt-neox for large-scale training.

Jan 02 '24 15:01 mjbommar

Gentle ping @shijie-wu :)

Jan 05 '24 14:01 muellerzr

Found that self.accelerator.clip_grad_norm_ will return None if we are using DeepSpeed with Trainer. In DeepSpeed we should use model.get_global_grad_norm() to get grad_norm:

_grad_norm = self.accelerator.clip_grad_norm_(
    model.parameters(),
    args.max_grad_norm,
)
if self.accelerator.distributed_type == DistributedType.DEEPSPEED:
    grad_norm = model.get_global_grad_norm()
else:
    grad_norm = _grad_norm.item() if _grad_norm is not None else None

Jan 08 '24 10:01 jubgjf

sorry for the delay! PTAL @muellerzr @mjbommar

Jan 17 '24 22:01 shijie-wu

Gentle ping @muellerzr @mjbommar :)

Jan 24 '24 19:01 shijie-wu

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Feb 15 '24 18:02 HuggingFaceDocBuilderDev

cc @amyeroberts for final review :)

Feb 16 '24 18:02 muellerzr

not sure if this was mentioned anywhere, but this PR breaks training checkpoint saving because

the grad norm is added to TrainerState.log_history as a tensor
TrainerState.save_to_json attempts to jsonify that tensor, which naturally errors out as Tensors can't be jsonified

my fix for this was to patch save_to_json to the following:

    def save_to_json(self, json_path: str):
        """Save the content of this instance in JSON format inside `json_path`."""
        selfd = dataclasses.asdict(self)
        for d in selfd['log_history']:
            if 'grad_norm' in d: d['grad_norm'] = d['grad_norm'].item()
        json_string = json.dumps(selfd, indent=2, sort_keys=True) + "\n"
        with open(json_path, "w", encoding="utf-8") as f: f.write(json_string)

but this is probably not the best approach to doing this

Mar 02 '24 13:03 152334H

@152334H it does convert grad_norm to number before passing it into _maybe_log_save_evaluate

https://github.com/huggingface/transformers/blob/831bc25d8fdb85768402f772cf65cc3d7872b211/src/transformers/trainer.py#L2010-L2016

same for deepspeed

https://github.com/microsoft/DeepSpeed/blob/bcc617a0009dd27b4e144de59979bd7770eaf57c/deepspeed/runtime/engine.py#L448-L458

what backend were you using?

Mar 02 '24 22:03 shijie-wu

Deepspeed zero2.

Seems likely that the type hint is not universally correct. The value returned in scaled_global_norm for zero2 is a tensor scalar. That value subsequently assigns _global_grad_norm without any .item().

Mar 02 '24 22:03 152334H

not sure if this was mentioned anywhere, but this PR breaks training checkpoint saving because

the grad norm is added to TrainerState.log_history as a tensor

TrainerState.save_to_json attempts to jsonify that tensor, which naturally errors out as Tensors can't be jsonified

I'm facing the same issue with deepspeed stage 1, can you please fix this. I need to use v4.38.0 for a different fix?

Mar 04 '24 19:03 shubhanjan99

Can you all try installing with pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item?

This PR may have fixed this too as well: https://github.com/huggingface/transformers/pull/29444

Mar 04 '24 19:03 muellerzr

Can you all try installing with pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item?

That fixed it for me! Thanks a lot

Mar 04 '24 21:03 shubhanjan99

same error here:

11%|████████████████████████▏ | 800/7050 [4:07:59<32:10:41, 18.53s/it]Trainer is attempting to log a value of "2.204314947128296" of type <class 'torch.Tensor'> for key "train/grad_norm" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.

Please tell me how to fix it?

Mar 19 '24 02:03 lucasjinreal

transformers
transformers copied to clipboard

storing & logging gradient norm in trainer

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

storing & logging gradient norm in trainer

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard