DeepSpeed [BUG] INFLIGHT parameters after evaluation

Describe the bug I adapted my training process from the hugging face trainer.py, so most of my trainer is similar to theirs. My model includes a language model and an external module for learning other parameters using information from the main model. I put the main and external parameters in separate groups in the optimizer. While testing the code on larger models with deepspeed, I encountered an assertion error after 50 training steps and 1 evaluation round. The error is related to the embedding matrix, which remains in "INFLIGHT" status after resuming training, while all other parameters are "AVAILABLE".

AssertionError: {'id': 786, 'status': 'INFLIGHT', 'numel': 38603520, 'ds_numel': 38603520, 'shape': (50265, 768), 'ds_shape': (50265, 768), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {3}}

The program runs fine if I simply train a language model without an external module.

To Reproduce I can provide the code if necessary, but hope that I can get some help on how to further debug this issue :)

Expected behavior I expected the training runs fine after evaluation.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Linux 4.18.0-425.13.1.el8_7.x86_64
GPU count and types: 1 machine x2 A100s
Interconnects (if applicable)
Python version: python 3.8

Launcher context I am using deepspeed to launch my program.

Mar 21 '23 15:03 xiamengzhou

I am experiencing the same issue on a single node using lightning and a transformers.BertModel at the second epoch of lightning's "sanity checking":

assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 0, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 23440896, 'shape': (0,), 'ds_shape': (30522, 768), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {702}}

system info

OS: Ubuntu 20.04.5 LTS, kernel: Linux 5.15.0-1030-gcp, arch: x86-64
GPU count and types: 1 x A100-SXM4-40GB
Python version: 3.9.16

Relevant dependencies

lightning==2.0.0
transformers==4.27.1
deepspeed==0.8.3
torch==2.0.0

Mar 22 '23 08:03 vlievin

Hello @xiamengzhou @vlievin. Thank you for reporting the issue to us! This issue seems related to the parameter partitioning in ZeRO-3. Could you provide some scripts for us to reproduce this issue?

Mar 22 '23 17:03 HeyangQin

Hi @HeyangQin, my issue is definitely related with stage 3. I don't experience it using stage 2. I will try to isolate the bug in a small script next week.

Mar 24 '23 18:03 vlievin

Hi @HeyangQin , here is a small script with minimal code to reproduce my problem. I am not sure this problem is related to deepspeed, it's might be an issue for lightning : in the code below, if I comment the trainer.validate(...) out, this works fine.

Minimal example

from typing import Any

import datasets
import deepspeed
import lightning.pytorch as pl
import torch
import transformers
from lightning.pytorch import strategies

class Collate:
    def __init__(self, tokenizer, max_length=512):
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __call__(self, examples: list[dict[str, Any]]) -> dict[str, torch.Tensor]:
        inputs = [example["question"] for example in examples]
        encodings = self.tokenizer(
            inputs,
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt",
        )
        return dict(encodings)


class DummyModel(pl.LightningModule):
    def __init__(self, bert: transformers.BertModel):
        super().__init__()
        self.bert = bert

    def forward(self, batch: dict[str, torch.Tensor]):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        output = self.bert(input_ids, attention_mask=attention_mask)
        loss = output.pooler_output.mean()
        return loss

    def training_step(self, batch: dict[str, Any], *args: Any, **kwargs: Any):
        loss = self.forward(batch)
        return loss

    def validation_step(self, batch: dict[str, Any], *args: Any, **kwargs: Any):
        loss = self.forward(batch)
        return loss

    def configure_optimizers(self):
        return deepspeed.ops.adam.DeepSpeedCPUAdam(self.parameters(), lr=1e-5)


if __name__ == "__main__":
    dataset = datasets.DatasetDict(dict(
        train=datasets.load_dataset("squad", split="train[:1%]"),
        validation=datasets.load_dataset("squad", split="validation[:1%]"),
    ))
    model = DummyModel(transformers.AutoModel.from_pretrained("bert-base-uncased"))
    tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")

    train_dataloader = torch.utils.data.DataLoader(
        dataset["train"],
        batch_size=32,
        collate_fn=Collate(tokenizer),
        num_workers=4,
    )
    val_dataloader = torch.utils.data.DataLoader(
        dataset["validation"],
        batch_size=32,
        collate_fn=Collate(tokenizer),
        num_workers=4,
    )

    trainer = pl.Trainer(
        devices=1,
        accelerator="gpu",
        precision="16-mixed",
        strategy=pl.strategies.DeepSpeedStrategy(
            stage=3,
            offload_optimizer=True,
            offload_parameters=True,
        ),
    )

    trainer.validate(model, dataloaders=val_dataloader) # <-- no problem when commenting this, the problem is probably related to `lightning`
    trainer.fit(model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader)

Dependencies

Created an environment with

conda create --name=debug-deepspeed python=3.9
conda activate debug-deepspeed
pip install torch==2.0
pip install lightning==2.0
pip install transformers datasets deepspeed

pip freeze:

aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.6.2
arrow==1.2.3
async-timeout==4.0.2
attrs==22.2.0
beautifulsoup4==4.12.0
blessed==1.20.0
certifi @ file:///croot/certifi_1671487769961/work/certifi
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.1
croniter==1.3.8
datasets==2.10.1
dateutils==0.6.12
deepdiff==6.3.0
deepspeed==0.8.3
dill==0.3.6
dnspython==2.3.0
email-validator==1.3.1
fastapi==0.88.0
filelock==3.10.7
frozenlist==1.3.3
fsspec==2023.3.0
h11==0.14.0
hjson==3.1.0
httpcore==0.16.3
httptools==0.5.0
httpx==0.23.3
huggingface-hub==0.13.3
idna==3.4
inquirer==3.1.3
itsdangerous==2.1.2
Jinja2==3.1.2
lightning==2.0.0
lightning-cloud==0.5.32
lightning-utilities==0.8.0
lit==16.0.0
markdown-it-py==2.2.0
MarkupSafe==2.1.2
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.0
ninja==1.11.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
ordered-set==4.1.0
orjson==3.8.8
packaging==23.0
pandas==1.5.3
psutil==5.9.4
py-cpuinfo==9.0.0
pyarrow==11.0.0
pydantic==1.10.7
Pygments==2.14.0
PyJWT==2.6.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-editor==1.0.4
python-multipart==0.0.6
pytorch-lightning==2.0.0
pytz==2023.2
PyYAML==6.0
readchar==4.0.5
regex==2023.3.23
requests==2.28.2
responses==0.18.0
rfc3986==1.5.0
rich==13.3.3
six==1.16.0
sniffio==1.3.0
soupsieve==2.4
starlette==0.22.0
starsessions==1.3.0
sympy==1.11.1
tokenizers==0.13.2
torch==2.0.0
torchmetrics==0.11.4
tqdm==4.65.0
traitlets==5.9.0
transformers==4.27.3
triton==2.0.0
typing_extensions==4.5.0
ujson==5.7.0
urllib3==1.26.15
uvicorn==0.21.1
uvloop==0.17.0
watchfiles==0.19.0
wcwidth==0.2.6
websocket-client==1.5.1
websockets==10.4
xxhash==3.2.0
yarl==1.8.2

Mar 28 '23 09:03 vlievin

Same issue encountered, I use pure DeepSpeed without lightning.

Apr 12 '23 07:04 kisseternity

Any update or temporary workaround regarding this issue?

Apr 17 '23 06:04 vlievin

same issue encountered, this was happen when I use accelerator.backward(loss) and use accelerator to get the model state dict accelerator.get_state_dict(model)

Apr 17 '23 09:04 xiongma

Same issue.

Apr 22 '23 15:04 pclucas14

@HeyangQin can u pay attention about this? thx very much

Apr 23 '23 14:04 xiongma

I've been trying to come up with a small standalone repro for this, but unfortunately it's not entirely clear what triggers the behavior. A few things I've noticed though that seem consistent with what others have said:

Only happens when using zero stage 3
Is triggered specifically when the model is trained for an epoch, then evaluated, then trained for the next epoch (on the first batch of the second epoch)
Is triggered in the forward pass when calling an embedding

A couple things that I was able to do to workaround the issue:

Vary the batch size. Some batch sizes seem to trigger the issue, others don't.
Avoid calling model.eval(). For some reason this will consistently prevent the error from happening for me.

Apr 24 '23 05:04 tgaddair

Hello @vlievin. This issue has been fixed by a collaborative effort with the Lightning team. A special thanks to you for the nice reproduction script. It was super helpful. Please update both the deepspeed and lightning to apply the fix.

May 19 '23 18:05 HeyangQin

Thank you @HeyangQin, this fix is super helpful!

May 20 '23 11:05 vlievin

Thanks for the fix @HeyangQin! Are there plans to release a patch release next week containing this fix?

May 21 '23 00:05 tgaddair

Hello @tgaddair. We plan to release a new deepspeed version by next week. I will keep you posted

May 22 '23 17:05 HeyangQin

I was still getting this error after updating to deep speed 0.9.3 and lightning 2.0.3.

it seems to be fixed when I forced my dataset (both train and Val!) size to be an exact multiple of batch size. This is consistent with one of @tgaddair observations.

Jun 08 '23 13:06 pretzel583

Hi @pretzel583. Could you share a reproduce script for us to better investigate the issue? Thank you

Jun 09 '23 18:06 HeyangQin

Same issue.

What should be the values of batch_size and mini_batch_size respectively? @ pretzel583

Mar 15 '24 02:03 cxjtju

DeepSpeed DeepSpeed copied to clipboard

[BUG] INFLIGHT parameters after evaluation

Minimal example

Dependencies

DeepSpeed
DeepSpeed copied to clipboard