accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Failing to load model using accelerate launch

Open TZeng20 opened this issue 1 year ago • 5 comments

Hi,

Without using accelerate launch, I am able to load flan-t5-xxl and use it for inference without a problem

model_name = "google/flan-t5-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map = 'auto')

However, when I use an accelerator script like below, I get a sigkill error when the model starts to load. Why is it that I can load using from_pretrained but not when using accelerate launch? Is there some deepspeed configuration that I need to specify?

import argparse
import os
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer, set_seed
from accelerate import Accelerator, DistributedType
from accelerate.utils import set_seed
import torch
from datasets import load_dataset, Dataset
from torch.utils.data import DataLoader

def eval_function(args):
    set_seed(args.seed)
    accelerator = Accelerator()
    model_name = "google/flan-t5-xxl"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)
    print('Finished loading model!')

    # load data
    input_text = ['This is the first sentence', 'This is the second sentence']
    ds = Dataset.from_dict({'prompt': input_text})
    ds = ds.map(lambda x: tokenizer(x['prompt']), remove_columns= ['prompt'])
    ds = ds.with_format('torch')
    dataloader = DataLoader(ds, batch_size=1)

    # Prepare everything
    model, eval_dataloader = accelerator.prepare(model, dataloader)
    model.eval()
    dummy_input = "Are you human?"
    batch = tokenizer(dummy_input, return_tensors="pt").to(accelerator.device)
    labels = tokenizer("yes", return_tensors="pt").input_ids.to(accelerator.device)
    outputs = model(**batch, labels = labels)

    print('Beginning prediction!')
    for batch in eval_dataloader:
        input_ids = batch['input_ids']
        torch.cuda.empty_cache()        
        outputs = accelerator.unwrap_model(model).generate(
            input_ids,
            min_length = 100, 
            max_length = 200,
            num_beams = 5, 
            num_return_sequences = 1,
            do_sample = True,
            top_p = 0.9,
            no_repeat_ngram_size=1,
            remove_invalid_values=True,
            synced_gpus=True)
        print(tokenizer.decode(outputs[0]))
     
def main():
    parser = argparse.ArgumentParser(description="Simple example of evaluation script.")
    parser.add_argument("--seed", type = int, default = 42, help="Random seed to set")
    parser.add_argument("--batch_size", type = int, default = 1, help="batch size for inference")
    args = parser.parse_args()
    eval_function(args)

if __name__ == "__main__":
    main()

Accelerate config compute_environment: LOCAL_MACHINE distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_offload_params: true fsdp_sharding_strategy: 1 fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: T5Block machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

TZeng20 avatar May 17 '23 05:05 TZeng20

The machine you are using lacks the necessary amount of RAM to load the model in the 4 processes at the same time (you need 4 times the amount of RAM basically).

sgugger avatar May 17 '23 13:05 sgugger

I see, thanks. To load a Flan-t5-xxl model (11b params, ~45 gb) on 4 processes, I would need roughly 180 gb of ram?

Just to clarify, does num_processes have to equal the number of gpus? Currently the machine Im using has 4x24gb gpus and 192 gb ram. If I set num_process equal to 1, will accelerate still use all 4 gpus? If I understand correctly, using from_pretrained in a notebook is just 1 process but utilising all 4 gpus.

Also, if I don't use FSDP (distributed_type = multi_gpu) is that basically just pytorch DDP?

TZeng20 avatar May 18 '23 01:05 TZeng20

Yes for DDP you need 4x the size of the model if you have 4 GPUs in CPU RAM. Not that training with Adam usually requires 4x the size of the model in GPU RAM (1 for the model, 1 for the gradients and 2 for the optimizer states) so with a 24GB GPU you can only fit a 6GB model, so 1.5B parameters model.

For flan-t5-xxl you will need to use DeepSpeed or FSDP to split your model across GPUs (which will also remove the need for the 4x amount of CPU RAM) so I think you really need to explore those two avenues.

sgugger avatar May 18 '23 13:05 sgugger

In the code above, I am using FSDP but still getting the sigkill error. So, is this still due to CPU RAM? Can I use deepspeed in addition to FSDP?

TZeng20 avatar May 19 '23 00:05 TZeng20

The sharding stategy is optimizer only though, from what I see in your Training arguments. You need to shard the model as well (cc @pacman100 who will know more on FSDP).

sgugger avatar May 19 '23 11:05 sgugger

Hello @TZeng20, the sigkill error happens due to from_pretrained loading the state_dict on CPU RAM for each of the 4 processes.

See this https://github.com/huggingface/accelerate/issues/1488 and https://github.com/huggingface/accelerate/issues/1214

pacman100 avatar Jun 13 '23 19:06 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 08 '23 15:07 github-actions[bot]

Not Stale. I hit the same issue. The issue comes from from_pretrained where it needs N (number of GPUs) * Model size in CPU memory. It always happens whatever the settings you give to accelerate because you load the model before accelerate starting to distribute the model. This needs to be fixed.

However, if you launch the code with simple python rather than torchrun or accelerate, device_map can shard the model into different GPUs.

fe1ixxu avatar Aug 12 '23 22:08 fe1ixxu

same issue; so it is not solved yet?

Yan2013 avatar Jan 12 '24 13:01 Yan2013