transformers icon indicating copy to clipboard operation
transformers copied to clipboard

storing & logging gradient norm in trainer

Open shijie-wu opened this issue 1 year ago • 9 comments

What does this PR do?

Report gradient norm during training - Fixes #26143

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

@muellerzr @pacman100

shijie-wu avatar Nov 06 '23 17:11 shijie-wu

cc @muellerzr

amyeroberts avatar Nov 06 '23 18:11 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jan 01 '24 08:01 github-actions[bot]

Thank you for the work on this, @shijie-wu!

It may seem like a little PR to some, but this would be a huge step to bring transformers closer to parity with projects like gpt-neox for large-scale training.

mjbommar avatar Jan 02 '24 15:01 mjbommar

Gentle ping @shijie-wu :)

muellerzr avatar Jan 05 '24 14:01 muellerzr

Found that self.accelerator.clip_grad_norm_ will return None if we are using DeepSpeed with Trainer. In DeepSpeed we should use model.get_global_grad_norm() to get grad_norm:

_grad_norm = self.accelerator.clip_grad_norm_(
    model.parameters(),
    args.max_grad_norm,
)
if self.accelerator.distributed_type == DistributedType.DEEPSPEED:
    grad_norm = model.get_global_grad_norm()
else:
    grad_norm = _grad_norm.item() if _grad_norm is not None else None

jubgjf avatar Jan 08 '24 10:01 jubgjf

sorry for the delay! PTAL @muellerzr @mjbommar

shijie-wu avatar Jan 17 '24 22:01 shijie-wu

Gentle ping @muellerzr @mjbommar :)

shijie-wu avatar Jan 24 '24 19:01 shijie-wu

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cc @amyeroberts for final review :)

muellerzr avatar Feb 16 '24 18:02 muellerzr

not sure if this was mentioned anywhere, but this PR breaks training checkpoint saving because

  1. the grad norm is added to TrainerState.log_history as a tensor
  2. TrainerState.save_to_json attempts to jsonify that tensor, which naturally errors out as Tensors can't be jsonified

my fix for this was to patch save_to_json to the following:

    def save_to_json(self, json_path: str):
        """Save the content of this instance in JSON format inside `json_path`."""
        selfd = dataclasses.asdict(self)
        for d in selfd['log_history']:
            if 'grad_norm' in d: d['grad_norm'] = d['grad_norm'].item()
        json_string = json.dumps(selfd, indent=2, sort_keys=True) + "\n"
        with open(json_path, "w", encoding="utf-8") as f: f.write(json_string)

but this is probably not the best approach to doing this

152334H avatar Mar 02 '24 13:03 152334H

@152334H it does convert grad_norm to number before passing it into _maybe_log_save_evaluate

https://github.com/huggingface/transformers/blob/831bc25d8fdb85768402f772cf65cc3d7872b211/src/transformers/trainer.py#L2010-L2016

same for deepspeed

https://github.com/microsoft/DeepSpeed/blob/bcc617a0009dd27b4e144de59979bd7770eaf57c/deepspeed/runtime/engine.py#L448-L458

what backend were you using?

shijie-wu avatar Mar 02 '24 22:03 shijie-wu

Deepspeed zero2.

Seems likely that the type hint is not universally correct. The value returned in scaled_global_norm for zero2 is a tensor scalar. That value subsequently assigns _global_grad_norm without any .item().

152334H avatar Mar 02 '24 22:03 152334H

not sure if this was mentioned anywhere, but this PR breaks training checkpoint saving because

  1. the grad norm is added to TrainerState.log_history as a tensor
  2. TrainerState.save_to_json attempts to jsonify that tensor, which naturally errors out as Tensors can't be jsonified

I'm facing the same issue with deepspeed stage 1, can you please fix this. I need to use v4.38.0 for a different fix?

shubhanjan99 avatar Mar 04 '24 19:03 shubhanjan99

Can you all try installing with pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item?

This PR may have fixed this too as well: https://github.com/huggingface/transformers/pull/29444

muellerzr avatar Mar 04 '24 19:03 muellerzr

Can you all try installing with pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item?

That fixed it for me! Thanks a lot

shubhanjan99 avatar Mar 04 '24 21:03 shubhanjan99

same error here:

11%|████████████████████████▏ | 800/7050 [4:07:59<32:10:41, 18.53s/it]Trainer is attempting to log a value of "2.204314947128296" of type <class 'torch.Tensor'> for key "train/grad_norm" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.

Please tell me how to fix it?

lucasjinreal avatar Mar 19 '24 02:03 lucasjinreal