accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

load_checkpoint_and_dispatch compatibility with accelerate FSDP?

Open eric-mitchell opened this issue 2 years ago • 5 comments

System Info

- `Accelerate` version: 0.19.0
- Platform: Linux-5.15.0-41-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- System RAM: 1007.69 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'deepspeed_config_file': 'configs/deepspeed_defaults.json', 'zero3_init_flag': False}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

I am trying to train a large (30B+) model with Accelerate & FSDP on 8 GPUs. I understand a model this large requires using a meta device to avoid instantiating the model 8 times (for each process) in CPU RAM.

However, I can't figure out the right way to use load_checkpoint_and_dispatch with accelerator.prepare. When I use load_checkpoint_and_dispatch, I see to get a copy of the model (split across all GPUs) for each GPU. However, I'd like to instantiate only the FSDP shards for that process. It feels like load_checkpoint_and_dispatch needs an accelerator argument, but one doesn't exist. The below code (using a small model for testing) gives the wrong behavior, with two copies of the model being created (i.e., both GPUs have 24GB VRAM used, not 12GB).


import transformers
from accelerate import Accelerator
import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.utils.data import DataLoader, Dataset
import time
import os
import getpass
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from huggingface_hub import hf_hub_download
from transformers import AutoConfig, AutoModelForCausalLM


class DummyDataset(Dataset):
    def __getitem__(self, index):
        return {"input_ids": torch.randint(0, 100, (512,))}

    def __len__(self):
        return 100


def main():
    PER_DEVICE_BATCH_SIZE = 2

    checkpoint = "EleutherAI/gpt-j-6b"
    weights_location = hf_hub_download(checkpoint, "pytorch_model.bin")

    config = AutoConfig.from_pretrained(checkpoint)
    with init_empty_weights():
        print('building policy')
        policy = AutoModelForCausalLM.from_config(config)

    print('loading checkpoint')
    policy = load_checkpoint_and_dispatch(policy, weights_location, device_map='auto', no_split_module_classes=["GPTJBlock"])
    print('building dataloader')
    training_dataloader = DataLoader(DummyDataset(), batch_size=PER_DEVICE_BATCH_SIZE, shuffle=True)

    accelerator = Accelerator()
    policy, training_dataloader = accelerator.prepare(policy, training_dataloader)
    optimizer = torch.optim.RMSprop(policy.parameters(), lr=0.001)

    for idx, batch in enumerate(training_dataloader):
        start = time.time()
        loss = policy(**batch, labels=batch['input_ids']).loss
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
        print(f'{idx} took {time.time() - start} seconds ({batch["input_ids"].shape}), {loss}, {loss.dtype}')

        test_generate_inputs = torch.arange(0, 10, dtype=torch.long).unsqueeze(0)
        test_generate_mask = torch.ones_like(test_generate_inputs)
        test_generate_inputs = test_generate_inputs.to(accelerator.device)
        test_generate_mask = test_generate_mask.to(accelerator.device)

        gen_start = time.time()
        with FSDP.summon_full_params(policy, writeback=False, recurse=False):
            test_generate_outputs = policy.generate(
                test_generate_inputs, attention_mask=test_generate_mask,
                do_sample=True, top_p=0.9, top_k=0, temperature=1.0, max_length=50, min_length=10, num_return_sequences=1, pad_token_id=50256)
        print('policy', test_generate_outputs.shape, time.time() - gen_start)

if __name__ == "__main__":
    main()

Expected behavior

I would expect load_checkpoint_and_dispatch to only load the shards for the current process, not the whole model. Since each process loads the whole model, we still have peak VRAM usage N times the model size (where N is the number of GPUs). Given that load_checkpoint_and_dispatch and FSDP are both tools for training large models, I was expecting a straightforward way to use them together, but I couldn't find one in the docs or issues. Perhaps I'm missing something, though!

Perhaps related to this issue. Tagging @sgugger who weighed in there.

eric-mitchell avatar May 30 '23 07:05 eric-mitchell

load_checkpoint_and_dispatch is a tool for inference, not for training. If you are using FSDP you should let FSDP insantiates the model on several devices and split the tensors as it does it differently from load_checkpoint_and_dispatch

sgugger avatar May 30 '23 13:05 sgugger

Got it- that's what I thought, but I wasn't sure how to let FSDP instantiate the model on several devices when doing the initial load on a meta device + loading a pre-trained checkpoint. Can you share a bit more detail on how this should work, or an example in the documentation? Thanks again.

eric-mitchell avatar May 30 '23 17:05 eric-mitchell

cc @pacman100 who would know more.

sgugger avatar May 30 '23 17:05 sgugger

@sgugger @pacman100 After some more experimentation, I think this almost gets the job done:

    with init_empty_weights():
        policy = transformers.AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-6.9b", cache_dir=get_cache_dir())

    def reset_parameters(self) -> None:
        pass  # dummy function for FSDP transfer to work

    for m in policy.modules():
        if m.__class__.__name__ in {'GPTNeoXLayer', 'GPTNeoXForCausalLM'}:
            # needed because PyTorch FSDP automatically calls reset_parameters for all modules on meta devices
            m.reset_parameters = reset_parameters.__get__(m, m.__class__)

    training_dataloader = DataLoader(DummyDataset(), batch_size=PER_DEVICE_BATCH_SIZE, shuffle=True)
    accelerator = Accelerator()
    policy, training_dataloader = accelerator.prepare(policy, training_dataloader)

We don't get a memory spike, and the model ends up correctly sharded across the GPUs, without duplicates. However, the pre-trained weights aren't loaded correctly (which isn't really surprising since they are loaded on the meta device).

I assume someone has gotten this working, since FSDP has been used to train huge (1T param) models. Any thoughts on what's missing? Thanks!

eric-mitchell avatar May 31 '23 05:05 eric-mitchell

To update on this, I think with regular torch FSDP memory-efficient FSDP initialization is possible by:

  • Only loading model parameters on the rank 0 device (load on the meta device on the others, with e.g. Accelerate init_empty_weights())
  • Use a param_init_fn like module.to_empty(device=f'cuda:{rank}') when creating the FSDP module on non-rank 0 devices
  • Use sync_module_states=True when creating the FSDP instance to sync parameter values with rank 0

Is something similar possible with Accelerate?

eric-mitchell avatar Jun 05 '23 18:06 eric-mitchell

To update on this, I think with regular torch FSDP memory-efficient FSDP initialization is possible by:

  • Only loading model parameters on the rank 0 device (load on the meta device on the others, with e.g. Accelerate init_empty_weights())
  • Use a param_init_fn like module.to_empty(device=f'cuda:{rank}') when creating the FSDP module on non-rank 0 devices
  • Use sync_module_states=True when creating the FSDP instance to sync parameter values with rank 0

Is something similar possible with Accelerate?

Hello @eric-mitchell, this seems interesting. Please provide me a working version of this because when I tried the model wasn't learning at all.

At present, the way to reduce the memory footprint on RAM is to use low_cpu_mem_usage=True while loading the pretrained model via from_pretrianed

pacman100 avatar Jun 13 '23 19:06 pacman100

@pacman100 sorry for the slow reply. I don't actually have a working example- this was just a hypothesis inspired by this comment in the PyTorch source. I will look into this further.

eric-mitchell avatar Jun 22 '23 06:06 eric-mitchell

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 17 '23 15:07 github-actions[bot]

@pacman100 Sorry again for the delay. For an example of this approach and discussion of some of the issues with it, check out https://github.com/pytorch/pytorch/issues/104026 on the PyTorch Github issues.

I also shared an alternative strategy that I can confirm works: creating the model and moving the model to shared CPU memory once before spawning multiple processes, and letting each new process use the same underlying set of shared-memory weights (which are then sharded and moved to the respective CUDA device).

eric-mitchell avatar Jul 19 '23 20:07 eric-mitchell

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 13 '23 15:08 github-actions[bot]