transformers NaN when training t5-large with bf16 on multiple GPUs

System Info

transformers version: 4.20.1
Platform: Linux-5.15.0-1017-gcp-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.0+cu113 (True)
Tensorflow version (GPU?): 2.4.4 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@patrickvonplaten

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm getting nan immediately when training t5-large using bfloat16 on multiple GPUs, but when I run the same script on a single GPU it's fine. I've made a small example below, which I'm running on a machine with 2 A100s. If I do CUDA_VISIBLE_DEVICES=0 python script.py the loss is fine, but if I just do python script.py I get nan from the first iteration.

from typing import List, Tuple

import torch
from torch.utils.data import Dataset, DataLoader

import transformers


class MyDataset(Dataset):
    def __init__(
        self,
        data: List[List[str]],
        tokenizer: transformers.PreTrainedTokenizerFast,
    ) -> None:
        super().__init__()
        self._data = data
        self._tokenizer = tokenizer
    
    def __len__(
        self,
    ) -> int:
        return len(self._data)
    
    def __getitem__(
        self,
        index: int
    ) -> List[str]:
        return self._data[index]
    
    def collate_fn(
        self,
        batch: List[List[str]],
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        prompts = [b[0] for b in batch]
        targets = [b[1] for b in batch]
        
        prompts_tokenized = self._tokenizer(
            text=prompts,
            padding=True,
            return_tensors="pt",
            return_attention_mask=True,
        )
        
        prompts_input_ids = prompts_tokenized["input_ids"]
        prompts_attention_mask = prompts_tokenized["attention_mask"]
        
        targets_tokenized = self._tokenizer(
            text=targets,
            padding=True,
            return_tensors="pt",
            return_attention_mask=True,
        )
        
        targets_input_ids = targets_tokenized["input_ids"]
        targets_attention_mask = targets_tokenized["attention_mask"]
        
        return (
            prompts_input_ids,
            prompts_attention_mask,
            targets_input_ids,
            targets_attention_mask,
        )


if __name__ == "__main__":
    model = transformers.T5ForConditionalGeneration.from_pretrained(
        "t5-large",
    )
    tokenizer = transformers.T5TokenizerFast.from_pretrained(
        "t5-large",
    )
    
    device = (
        torch.device("cuda:0")
        if torch.cuda.is_available()
        else torch.device("cpu")
    )
    multi_gpu = torch.cuda.device_count() > 1
    
    if multi_gpu:
        model = torch.nn.DataParallel(model)
    model = model.to(device)
    
    optimizer = transformers.Adafactor(
        params=model.parameters(),
        lr=1e-4,
        scale_parameter=False,
        relative_step=False,
    )
    
    grad_scaler = torch.cuda.amp.GradScaler(
        enabled=True,
    )
    
    my_data = [
        [f"This is sentence {i}.", f"This is sentence {i + 1}."]
        for i in range(1000000)
    ]
    
    dataset = MyDataset(
        data=my_data,
        tokenizer=tokenizer,
    )
    
    dataloader = DataLoader(
        dataset=dataset,
        batch_size=8,
        shuffle=True,
        collate_fn=dataset.collate_fn,
    )
    
    for batch in dataloader:
        with torch.autocast(
            enabled=True,
            device_type=device.type,
            dtype=torch.bfloat16,
        ):
            batch = [b.to(device) for b in batch]
            
            (
                prompts_input_ids,
                prompts_attention_mask,
                targets_input_ids,
                targets_attention_mask,
            ) = batch

            loss = model(
                input_ids=prompts_input_ids,
                attention_mask=prompts_attention_mask,
                labels=targets_input_ids,
            ).loss
            
            if multi_gpu:
                loss = loss.mean()

            grad_scaler.scale(loss).backward()
            grad_scaler.step(optimizer)
            grad_scaler.update()
            optimizer.zero_grad()

            print(f"Loss = {loss.item()}")

Expected behavior

No nans when training t5-large using bfloat16 on multiple GPUs.

Sep 05 '22 11:09 harshil-shah

@LysandreJik perhaps you could suggest someone who can help with this please?

Sep 14 '22 10:09 harshil-shah

I believe @stas00 has some experience around bfloat16 and nans and may have an idea of where the issue may be coming from

Sep 14 '22 18:09 LysandreJik

I have tried t5-large, tested your script to work fine with t5-small - need to find a box with a few large gpus to test t5-large.

Meanwhile, we should revisit the scaling.

the main benefit of using bf16 over fp16 is that there is very little risk of overflow - since bf16's numerical range is the same as of fp32, so no down scaling is needed here.

But perhaps we are hitting underflow here. There is a special tool we have for that - you can try to plug it in and observe where (most likely) underflow is happening https://huggingface.co/docs/transformers/debugging#underflow-and-overflow-detection

But then underflow would just lead to no learning and not really nan I think.

I will try to experiment more with it once I'm able to run t5-large.

Sep 14 '22 22:09 stas00

Thanks @stas00 - I had a go at using the underflow/overflow detection tool but actually when I switched from DataParallel to DistributedDataParallel I didn't get nans with this toy example! I'll try to do some experiments with some real data next week and let you know if this solves it.

Sep 16 '22 11:09 harshil-shah

oh, great, then I don't need to look for a set of large GPUs :) Thank you for this update, @harshil-shah!

Indeed please do let us know when you get a chance to experiment

Sep 16 '22 19:09 stas00

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 11 '22 15:10 github-actions[bot]