transformers icon indicating copy to clipboard operation
transformers copied to clipboard

NaN in XGLM Softmax with FP16

Open gsarti opened this issue 2 years ago • 4 comments

System Info

  • transformers version: 4.21.0.dev0
  • Platform: Linux-5.3.0-1017-x86_64-with-glibc2.27
  • Python version: 3.9.13
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.0+cu102 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes, 4x V100

Who can help?

@ydshieh @patrickvonplaten

Reproduction

Related to the fixes in #17437 most likely.

I am using an example similar to test_batched_nan_fp16 in test_modeling_opt.py, but for an XGLM model. The only difference with that test is the torch.cuda.amp.autocast usage, which I found necessary to perform inference (otherwise I would get an error saying "expected scalar type Float but found Half" coming from the forward of XGLM)

import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

# Tested with xglm-564M and 7.5B (the second using `infer_auto_device_map` and 
# `load_checkpoint_and_dispatch` from `accelerate`.
tokenizer = AutoTokenizer.from_pretrained("facebook/xglm-564M", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("facebook/xglm-564M", torch_dtype=torch.float16, use_cache=True).cuda()

batch = tokenizer(["Who are you?", "Joe Biden is the president of"], padding=True, return_tensors="pt")

input_ids = batch["input_ids"].cuda()
attention_mask = batch["attention_mask"].cuda()

with torch.no_grad():
    with torch.cuda.amp.autocast():
        outputs = model(input_ids, attention_mask=attention_mask)
        assert not torch.isnan(outputs.logits[0]).any().item() # Raises an AssertionError

Expected behavior

I would expect the model to have normal logits when using FP16. The spotting of this bug was prompted by an issue of garbage generation when doing batching, despite the left padding and a valid attention mask.

gsarti avatar Jul 06 '22 23:07 gsarti

cc @patil-suraj for XLGM

patrickvonplaten avatar Jul 07 '22 12:07 patrickvonplaten

From padding=left in the code snippet, I would guess this is similar to issue #17433 and the fix in #17437 should fix it. @gsarti Would you like to try it and maybe also open a PR?

P.S I actually thought yesterday if I should apply #17437 to all models, and tried a few models like GPT2, Bart which are fine. Maybe it is indeed better to do it for all models.

ydshieh avatar Jul 07 '22 13:07 ydshieh

Opened PR #18057 with the suggested fix @ydshieh, should be good to go!

gsarti avatar Jul 07 '22 14:07 gsarti

Can we close this one? cc @ydshieh ?

patrickvonplaten avatar Sep 27 '22 11:09 patrickvonplaten

I re-opened that PR #18057. Let's see if @gsarti would like to continue the work, otherwise I can take it. The necessary fix is minimal.

ydshieh avatar Sep 27 '22 14:09 ydshieh