DeepSpeed [BUG] Resume from checkpoint Out of memory error (SIGTERM: Killed) for a large model

Describe the bug

Using the latest version of transformers and deepspeed.
It is possible to fit a ~1.5B parameter model and train it from scratch with batch sizes up to 32 (with gradient checkpointing, bf32 type) on a single 3090 RTX, which is amazing btw.
However, if I try to resume from a checkpoint with even a batch size of 1, there is an OOM error. The message is SIGTERM: Killed.
I believe this is a bug, since the model can easily fit (the training takes up only 22Gs!) if you start a fresh job, but doesn't work with continuing a job. Also, it is really painful having to restart a job that you trained for 2 weeks, knowing that (verified by performing inference on the checkpoint with good results) your weights are valid, but useless for further training.

To Reproduce I'm working with a custom diffusion model, however you can verify this with following models of similar size: OPT-1.3b, whisper-large, Stable Diffusion 1.4. Train for 10 iterations, save a checkpoint. Then restart. The memory consumption jumps to the maximum, then OOM.

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

System info (please complete the following information):

OS: Ubuntu 22
GPU count and types: Single RTX 3090 (24G)
Python version: 3.9
64G RAM, AMD Ryzen 3950x

Launcher context deepspeed train.py

Thank you!

Mar 27 '23 17:03 ozanciga

to add to this, every time i save a checkpoint with optimizer states, i notice an increase in ram use. at a certain point, oom and the script crashes. as above. i see this as a bug since it is possible to start a fresh training job with the same weights without crashing but throughout the training the memory use increases despite batch size staying the same.

Mar 29 '23 15:03 ozanciga

@ozanciga, can you please share a stack trace of the OOM? Is the OOM CPU or GPU? Can you share run script?

We may not be able to repro the OOM since our hardware is different, but the above information can help the investigation.

Mar 30 '23 12:03 tjruwase

hey @tjruwase certainly, here it is below. also the OOM happens at RAM not VRAM (verified with htop).

regarding the code, it is a pretty standard transformers trainer boilerplate, also attached. please note that for smaller models the same code works, so i'm not sure if the code has any bearing on the result.

    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=train_batch_size,
        gradient_accumulation_steps=1,
        learning_rate=learning_rates[model_name],
        # weight_decay=0.1,
        adam_beta1=0.9,
        adam_beta2=0.98,
        adam_epsilon=1e-6,
        warmup_ratio=0.02, 
        lr_scheduler_type='linear',
        max_steps=num_train_steps,
        num_train_epochs=num_train_epochs,
        gradient_checkpointing=True,
        bf16=True,
        evaluation_strategy="steps",
        save_strategy="steps",
        per_device_eval_batch_size=eval_batch_size,
        predict_with_generate=False,
        eval_accumulation_steps=1,
        generation_max_length=225,
        save_steps=500,
        eval_steps=5_000_000,
        logging_steps=25,
        report_to=["tensorboard"],
        greater_is_better=False,
        push_to_hub=False,
        # ignore_data_skip=True,
        deepspeed="ds_config_gptj.json",
    )

    trainer = Seq2SeqTrainer(
        args=training_args,
        model=model,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )

    trainer.train(
        resume_from_checkpoint='checkpoint-1000',
    )

[2023-03-30 08:41:18,386] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-30 08:41:18,395] [INFO] [runner.py:550:main] cmd = /home/hsa/anaconda3/envs/trlx/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train_hf.py
[2023-03-30 08:41:19,618] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-03-30 08:41:19,618] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-03-30 08:41:19,618] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-03-30 08:41:19,618] [INFO] [launch.py:162:main] dist_world_size=1
[2023-03-30 08:41:19,618] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-03-30 08:41:46,344] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using /home/hsa/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/hsa/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.1861093044281006 seconds
Using /home/hsa/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/hsa/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.6080682277679443 seconds
Parameter Offload: Total persistent parameters: 1195520 in 742 params
Using /home/hsa/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0018641948699951172 seconds
[2023-03-30 08:42:32,825] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 318387
[2023-03-30 08:42:32,847] [ERROR] [launch.py:324:sigkill_handler] ['/home/hsa/anaconda3/envs/trlx/bin/python3.9', '-u', 'train_hf.py', '--local_rank=0'] exits with return code = -9

Mar 30 '23 12:03 ozanciga

You can try following the suggestion of adding more memory via swap file. We faced a similar issue before and that resolved it

Apr 11 '23 20:04 RameshArvind

@ozanciga, sorry that you are losing your training cycles due to inability to load the checkpoints.

I think the problem might be related to https://github.com/microsoft/DeepSpeed/issues/3303. One way to confirm is to inspect the checkpoint folder size. Can you please share the results of running du -a -h --max-depth=1 on your checkpoint folder?

Apr 25 '23 19:04 tjruwase

Closing for as this seems resolved. Please re-open if needed.

May 15 '23 15:05 tjruwase