transformers icon indicating copy to clipboard operation
transformers copied to clipboard

XGLM - Fix Softmax NaNs when using FP16

Open gsarti opened this issue 2 years ago • 6 comments

What does this PR do?

Fixes #18049 following the exact same procedure used in #17437. Beside the added test, I also evaluated the fix on my personal use-case and found the behavior of the fixed model to be consistent when performing single or batched generation.

Before submitting

  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [x] Did you write any new necessary tests?

Who can review?

@patil-suraj @ydshieh @patrickvonplaten

gsarti avatar Jul 07 '22 14:07 gsarti

The documentation is not available anymore as the PR was closed or merged.

@patil-suraj I think only your check is missing!

gsarti avatar Aug 03 '22 14:08 gsarti

Sorry for being so late here @gsarti! Merged master into it to ping circle ci here

patrickvonplaten avatar Aug 23 '22 17:08 patrickvonplaten

Hey @gsarti - it seems like a test is failing now:

tests/models/xglm/test_modeling_xglm.py::XGLMModelTest::test_xglm_model_past

with

 UnboundLocalError: local variable 'dtype_attn_weights' referenced before assignment

patrickvonplaten avatar Aug 23 '22 18:08 patrickvonplaten

Hey @gsarti - it seems like a test is failing now:

tests/models/xglm/test_modeling_xglm.py::XGLMModelTest::test_xglm_model_past

with

 UnboundLocalError: local variable 'dtype_attn_weights' referenced before assignment

I noticed this when running the code. My understanding is that setting dtype_attn_weights as torch.float32 as default beforehand would fix the issue and maintain the expected behavior, could you double-check?

gsarti avatar Aug 24 '22 14:08 gsarti

Hi @gsarti Sorry for being late for this PR. I re-opened it and give some suggestion for a fix to the failing test. Would you like to update this PR after rebasing your working branch on an updated main branch?

ydshieh avatar Sep 27 '22 14:09 ydshieh

Hi @gsarti I made the necessary change to pass the tests, and pushed to your branch directly. The remaining failing test is irrelevant to this PR, but I will wait until tomorrow to check again, then I will merge.

cc @patrickvonplaten and @younesbelkada

ydshieh avatar Sep 28 '22 19:09 ydshieh

Thanks a lot for the fix @ydshieh !! I think for consistency we should apply the same changes on OPT too, I will take care of that first thing in the morning tomorrow 💪

younesbelkada avatar Sep 28 '22 19:09 younesbelkada