accelerate
accelerate copied to clipboard
Failing to load model using accelerate launch
Hi,
Without using accelerate launch
, I am able to load flan-t5-xxl and use it for inference without a problem
model_name = "google/flan-t5-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map = 'auto')
However, when I use an accelerator script like below, I get a sigkill error when the model starts to load.
Why is it that I can load using from_pretrained
but not when using accelerate launch
?
Is there some deepspeed configuration that I need to specify?
import argparse
import os
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer, set_seed
from accelerate import Accelerator, DistributedType
from accelerate.utils import set_seed
import torch
from datasets import load_dataset, Dataset
from torch.utils.data import DataLoader
def eval_function(args):
set_seed(args.seed)
accelerator = Accelerator()
model_name = "google/flan-t5-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
print('Finished loading model!')
# load data
input_text = ['This is the first sentence', 'This is the second sentence']
ds = Dataset.from_dict({'prompt': input_text})
ds = ds.map(lambda x: tokenizer(x['prompt']), remove_columns= ['prompt'])
ds = ds.with_format('torch')
dataloader = DataLoader(ds, batch_size=1)
# Prepare everything
model, eval_dataloader = accelerator.prepare(model, dataloader)
model.eval()
dummy_input = "Are you human?"
batch = tokenizer(dummy_input, return_tensors="pt").to(accelerator.device)
labels = tokenizer("yes", return_tensors="pt").input_ids.to(accelerator.device)
outputs = model(**batch, labels = labels)
print('Beginning prediction!')
for batch in eval_dataloader:
input_ids = batch['input_ids']
torch.cuda.empty_cache()
outputs = accelerator.unwrap_model(model).generate(
input_ids,
min_length = 100,
max_length = 200,
num_beams = 5,
num_return_sequences = 1,
do_sample = True,
top_p = 0.9,
no_repeat_ngram_size=1,
remove_invalid_values=True,
synced_gpus=True)
print(tokenizer.decode(outputs[0]))
def main():
parser = argparse.ArgumentParser(description="Simple example of evaluation script.")
parser.add_argument("--seed", type = int, default = 42, help="Random seed to set")
parser.add_argument("--batch_size", type = int, default = 1, help="batch size for inference")
args = parser.parse_args()
eval_function(args)
if __name__ == "__main__":
main()
Accelerate config compute_environment: LOCAL_MACHINE distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_offload_params: true fsdp_sharding_strategy: 1 fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: T5Block machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
The machine you are using lacks the necessary amount of RAM to load the model in the 4 processes at the same time (you need 4 times the amount of RAM basically).
I see, thanks. To load a Flan-t5-xxl model (11b params, ~45 gb) on 4 processes, I would need roughly 180 gb of ram?
Just to clarify, does num_processes have to equal the number of gpus? Currently the machine Im using has 4x24gb gpus and 192 gb ram. If I set num_process equal to 1, will accelerate still use all 4 gpus? If I understand correctly, using from_pretrained
in a notebook is just 1 process but utilising all 4 gpus.
Also, if I don't use FSDP (distributed_type = multi_gpu) is that basically just pytorch DDP?
Yes for DDP you need 4x the size of the model if you have 4 GPUs in CPU RAM. Not that training with Adam usually requires 4x the size of the model in GPU RAM (1 for the model, 1 for the gradients and 2 for the optimizer states) so with a 24GB GPU you can only fit a 6GB model, so 1.5B parameters model.
For flan-t5-xxl you will need to use DeepSpeed or FSDP to split your model across GPUs (which will also remove the need for the 4x amount of CPU RAM) so I think you really need to explore those two avenues.
In the code above, I am using FSDP but still getting the sigkill error. So, is this still due to CPU RAM? Can I use deepspeed in addition to FSDP?
The sharding stategy is optimizer only though, from what I see in your Training arguments. You need to shard the model as well (cc @pacman100 who will know more on FSDP).
Hello @TZeng20, the sigkill
error happens due to from_pretrained
loading the state_dict on CPU RAM for each of the 4 processes.
See this https://github.com/huggingface/accelerate/issues/1488 and https://github.com/huggingface/accelerate/issues/1214
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Not Stale. I hit the same issue. The issue comes from from_pretrained
where it needs N (number of GPUs) * Model size in CPU memory. It always happens whatever the settings you give to accelerate because you load the model before accelerate starting to distribute the model. This needs to be fixed.
However, if you launch the code with simple python
rather than torchrun
or accelerate
, device_map
can shard the model into different GPUs.
same issue; so it is not solved yet?