accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

load_checkpoint_and_dispatch OOMs

Open sriniiyer opened this issue 1 year ago • 4 comments

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-5.15.0-1015-aws-x86_64-with-glibc2.31
- Python version: 3.9.16
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.0.0.dev20230202+cu116 (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'deepspeed_config_file': 'deepspeed_z3.json', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

import sys
import torch
from transformers import LlamaForCausalLM
from accelerate import Accelerator
import deepspeed
from accelerate import load_checkpoint_and_dispatch

def main():
    accelerator = Accelerator()
    sft_model = LlamaForCausalLM.from_pretrained('llama-7b/')
    sft_model = load_checkpoint_and_dispatch(sft_model, 'models/best/pytorch_model.bin', device_map="auto")

    opt = torch.optim.Adam(sft_model.parameters(), lr=1e-5)
    (sft_model, opt, ) = accelerator.prepare(sft_model, opt,)

    sft_model.train()
    accelerator.train()

if __name__ == "__main__":
    sys.exit(main())

Expected behavior

This OOMs on a node with 8 80GB A100, and it should not. Also tried device_map = balanced and balanced_low_0, and I get the same OOM.

sriniiyer avatar May 18 '23 18:05 sriniiyer

You cannot send your model to accelerator.prepare if using device_map="auto" (as the model will be split across GPUs already).

sgugger avatar May 18 '23 19:05 sgugger

You cannot send your model to accelerator.prepare if using device_map="auto" (as the model will be split across GPUs already).

Looks like load_state_dict is trying to put the entire model on GPU 0. I have also tried balanced and balanced_low_0. I have also tried infer_auto_device_map, which also decides to put the whole model on gpu-0 coz the GPU is large enough to hold it. Putting aside the fact that it's not splitting the model across GPUs, a single GPU should be more than enough to hold this state_dict, so why does it OOM? I can torch.load() this onto a single 80GB gpu very easily.

sriniiyer avatar May 18 '23 20:05 sriniiyer

@sgugger The from_pretrained works fine, so if I had saved my checkpoint using save_pretrained, everything is great. Unfortunately, I saved the state_dict, so when I load the state_dict, it OOMs with accelerate, even though there is plenty memory for the model to fit 3 times on 1 gpu.

sriniiyer avatar May 19 '23 00:05 sriniiyer

The snippet of code above does not match what you are telling me. Could you please share something I can reproduce?

sgugger avatar May 19 '23 11:05 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 18 '23 15:06 github-actions[bot]