accelerate
accelerate copied to clipboard
Model Parallelism and accelerate's usage of DDP aren't compatible
System Info
- `Accelerate` version: 0.18.0
- Platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.31
- Python version: 3.9.12
- PyTorch version (GPU?): 1.12.0 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 16
- machine_rank: 0
- num_machines: 16
- main_process_ip: 192.168.1.1
- main_process_port: 8080
- rdzv_backend: static
- same_network: False
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
If I use model parallelism (for example using huggingface parallelize), and I'm using accelerate with a standard multi-GPU environment (that uses DDP), then when I prepare the model I get the following error:
File "/private/home/raileanu/new-rlvsil/rlvsil/experiment_accel.py", line 689, in main
model, optimizer, lr_scheduler, *prepared_dataloaders = accelerator.prepare(
File "/private/home/raileanu/.conda/envs/rob/lib/python3.9/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/private/home/raileanu/.conda/envs/rob/lib/python3.9/site-packages/accelerate/accelerator.py", line 1123, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/private/home/raileanu/.conda/envs/rob/lib/python3.9/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/private/home/raileanu/.conda/envs/rob/lib/python3.9/site-packages/accelerate/accelerator.py", line 1202, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/private/home/raileanu/.conda/envs/rob/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 571, in __init__
self._log_and_throw(
File "/private/home/raileanu/.conda/envs/rob/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 674, in _log_and_throw
raise err_type(err_msg)
ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1)}.
I think this is because in line https://github.com/huggingface/accelerate/blob/2708c1ae31f5c32a0715780c2244c8f24ba1cfc3/src/accelerate/accelerator.py#L1222 it initialises the DDP model by setting device_ids and output_device, whereas these should both be set to None
if using model parallelism.
You should be able to reproduce this on a 4-gpu machine with something like the following:
from transformers import GPTJForCausalLM
import accelerate
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
device_map = {
0: [0, 1, 2, 3, 4, 5, 6],
1: [7, 8, 9, 10, 11, 12, 13],
2: [14, 15, 16, 17, 18, 19, 20],
3: [21, 22, 23, 24, 25, 26, 27],
}
model.parallelize(device_map)
accelerator = accelerate.Accelerator()
model = accelerator.prepare(model)
I'm currently getting around this by wrapping the model in DDP myself with the correct arguments, and then doing accelerator._models.append(model)
.
Expected behavior
I'd expect accelerate's usage of DDP to be compatible with naïve model parallelism, as DDP is compatible with it.
I think the fix would be to adjust https://github.com/huggingface/accelerate/blob/2708c1ae31f5c32a0715780c2244c8f24ba1cfc3/src/accelerate/accelerator.py#L1222 such that if the model has parameters on multiple devices, or the hf_device_map uses multiple devices, (or maybe the user passes an explicit parameters saying they're using model parallelism), the DDP initialisation doesn't set device_ids
and output_device
. I'd be happy to submit a PR to make that change if that seems reasonable.
Yes, Accelerate does not support DDP with model parallelism. I'm not sure your proposed fix would work as DDP will all-reduce the gradients across GPUs except all GPUs don't have the same parameters.
For pipeline parallelism as you are trying to acheive, use FSDP or DeepSpeed
If you make sure each accelerate process gets multiple GPUs, then I think DDP will work as expected - so you have 1 accelerate process and hence 1 DDP model per 4 gpus (for example), then you should get the correct synchronisation. For example that's what this tutorial implies.
In my setup preparing the model separately does work as intended, with 4 accelerate processes with 3 GPUs each and the model layers split across those 3 GPUs. I'm launching each those processes with a separate call however, so I'm uncertain how you'd do it with a single call on a multi-GPU machine if you wanted 4 processes with 2GPUs each rather than 8 processes with 1 GPU each.
I'll look into trying to get FSDP to work in my set up as well.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@sgugger need your guidance, I wanna use
model = AutoModelForCausalLM.from_pretrained(
model_name, quantization_config=bnb_config, device_map={'':torch.cuda.current_device()}, trust_remote_code=True
)
to train 40b, but also wanna DDP, then how should I achieve it? Thanks
You can use DDP if your model is only on one device like this.
@sgugger Thanks for your fast help. But what if the model is too big for one GPU device?
Then you cannot use DDP + device_map="auto"
. You need to use DeepSpeed or FSDP.
I feel like you are not listening. You cannot use DDP + device_map="auto"
and thus not DDP + device_map="auto" + DeepSpeep
either. You need to just use DeepSpeed ZeRO-3 to shard your model on several devices and train it with a mix of model parallelism and data parallelism.
I feel like you are not listening. You cannot use
DDP + device_map="auto"
and thus notDDP + device_map="auto" + DeepSpeep
either. You need to just use DeepSpeed ZeRO-3 to shard your model on several devices and train it with a mix of model parallelism and data parallelism.
Sorry for my misunderstanding, I got your point now
@sgugger Just to make sure my understanding is correct, can we use deepspeed
support with the Trainer
API to do model + data parallel (without setting device_map
) or do we have to write code with pure deepspeed
without HF transformers to load the model? Sorry if my question is repetitive.
As long as you properly configure DeepSpeed ZeRO-3, you won't need to use device_map="auto"
yes, and the model will be loaded on several GPUs (each weight will be split).
Just to document my experience on getting DDP + MP (2x2 on 4 gpus) to work with Accelerate (via HF trainer):
I modified the current main branch to initialize the DDP model by setting device_ids and output_device to None, as described in the pytorch docs when using multi-device modules. Additionally, I had to remove some ValueErrors that are being raised (for no good reason?).
I launch two processes with torchrun; each supposed to use 2 gpus. Each process uses a different device map to load the model (llama-2-7b). Process 1 uses gpus 0,2, and process 2 uses gpus 1,3. Example device map for process 1:
{'model.embed_tokens': 0, 'model.norm': 2, 'lm_head': 2, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0 , 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 2, 'model.layers.17': 2, 'model.layers.18': 2, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 2, 'model.layers.28': 2, 'model.layers.29': 2, 'model.layers.30': 2, 'model.layers.31': 2}
I set training_args.place_model_on_device=False
, as the model is already placed on devices:
model = transformers.LlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
torch_dtype=torch.bfloat16 if training_args.bf16 else torch.float32,
device_map=device_map
)
This works for me and produces exactly the same loss curves compared to using only DDP or only MP.
@maxidl can you share your modified code? Curious what those exceptions are that exist for "no good reason"
@maxidl can you share your modified code? Curious what those exceptions are that exist for "no good reason"
@muellerzr I do think these error are necessary if one does not also modify the DDP construction. In fact, they are correct if one has created the device_map
with 'auto'. However, the errors get triggered even when using a custom device_map
.
You can find my fork here: https://github.com/maxidl/accelerate/commit/332d960d625deda76090c32a6e67dee70be76761
Also, note that I did not run any tests and check whether this breaks other behavior.
Now why do I think it is useful to have DDP + MP (in the classic pipeline of layers way): In my case, I am running gpus without fast interconnect (nvlink) which makes FSDP style training very slow.
Thanks @maxidl, as an approach here's what the team has decided we will do:
- I'll put a PR in today that let's you explicitly disable the blocking behavior, and will set it to
None
as you have shown in your example. - We'll keep this issue open, and I ask that the community react with a 👍 to this message if you wind up using this. We want to see what kind of usage folks are having with this, and when we can turn it from a "power-user" feature into something more folks are using.
- Long term we'll see how to enable these kind of native TP trainings directly with accelerate + proper config, once we get a decent amount of folks wanting this.
Seem reasonable @maxidl? And thank you for this reproducer!
Sure, that sounds great. Once the changes are in (no rush with that), I might create a tutorial-style GitHub repo for it and do some benchmarking, to be shared via Twitter (sorry, "X" ....).
@muellerzr
Sorry I wanna bring this up again, is it possible to add this functionality as a feature, background is we wanna tune 70b or 8x7b model as a teacher, tried to use FSDP, but lots of feature is not supported in FSDP, DS is even worse, e.g. nested quantization, sliding window attention, the final total saving is actually not that much.
the following is my testing code, basically for each node, we have 8 A100_80gb GPU, each training process will take 2 GPU: (btw, I haven't check @maxidl's approach yet, will have a look for sure)
def create_and_prepare_model():
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
# load_in_8bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 1, 'model.layers.37': 1, 'model.layers.38': 1, 'model.layers.39': 1, 'model.layers.40': 1, 'model.layers.41': 1, 'model.layers.42': 1, 'model.layers.43': 1, 'model.layers.44': 1, 'model.layers.45': 1, 'model.layers.46': 1, 'model.layers.47': 1, 'model.layers.48': 1, 'model.layers.49': 1, 'model.layers.50': 1, 'model.layers.51': 1, 'model.layers.52': 1, 'model.layers.53': 1, 'model.layers.54': 1, 'model.layers.55': 1, 'model.layers.56': 1, 'model.layers.57': 1, 'model.layers.58': 1, 'model.layers.59': 1, 'model.layers.60': 1, 'model.layers.61': 1, 'model.layers.62': 1, 'model.layers.63': 1, 'model.layers.64': 1, 'model.layers.65': 1, 'model.layers.66': 1, 'model.layers.67': 1, 'model.layers.68': 1, 'model.layers.69': 1, 'model.layers.70': 1, 'model.layers.71': 1, 'model.layers.72': 1, 'model.layers.73': 1, 'model.layers.74': 1, 'model.layers.75': 1, 'model.layers.76': 1, 'model.layers.77': 1, 'model.layers.78': 1, 'model.layers.79': 1, 'model.norm': 1, 'lm_head': 1}
device1 = torch.cuda.current_device()
device2 = device1 + 2
for k,v in device_map.items():
device_map[k] = device1 if v == 0 else device2
model = AutoModelForCausalLM.from_pretrained(
model_name, quantization_config=bnb_config, trust_remote_code=True,
# device_map='auto',
device_map=device_map,
# device_map={'':torch.cuda.current_device()},
use_flash_attention_2=True
)
print(model)
print(model.hf_device_map)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
if '70b' in model_from:
model.config.max_position_embeddings = 4096
peft_config = LoraConfig(
lora_alpha=alpha,
lora_dropout=0.1,
r=rank,
bias="none",
task_type="CAUSAL_LM",
target_modules=target_modules
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# for llama family
# tokenizer.padding_side = 'right'
# # for mistral family
# tokenizer.padding_side = 'left'
return model, peft_config, tokenizer
# import multiprocessing
# NUM_PROC = multiprocessing.cpu_count() #should be num cpus
save_dir = "/sensei-fs/tenants/Sensei-AdobeSearch/CreativeLLM/zhangli/cpt-hf/runai_experiment/"
training_arguments = TrainingArguments(
output_dir=os.path.join(save_dir, run_name),
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=accumlate_steps,
optim="paged_adamw_8bit",
# Can't resume training: Error invalid argument at line 393 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterface.c
# optim="paged_adamw_8bit",
save_steps=500,
logging_steps=10,
learning_rate=lr,
fp16=True,
max_grad_norm=0.3,
num_train_epochs=100,
warmup_ratio=0.03,
# group_by_length=True,
lr_scheduler_type="constant",
run_name=run_name,
evaluation_strategy="steps",
eval_steps=200,
ddp_find_unused_parameters=False,
gradient_checkpointing=True,
# weight_decay=0.01,
# dataloader_num_workers=NUM_PROC//2
)
model, peft_config, tokenizer = create_and_prepare_model()
model.config.use_cache = False # because of gradient checkpointing
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=length,
tokenizer=tokenizer,
args=training_arguments,
packing=True,
)
trainer.train()
btw, beside this, any other memory optimization approach I can take?
@muellerzr @maxidl, because I loaded the model in 4bit so I also comment out this line: https://github.com/maxidl/accelerate/blob/332d960d625deda76090c32a6e67dee70be76761/src/accelerate/accelerator.py#L1342
But don't know is there any bad effect, it starts to train at least, could you pls elaborate potential consequences