accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

"You can't train a model that has been loaded with device_map='auto' in any distributed mode" error when running on multi-GPU VM

Open tom-ph opened this issue 1 year ago • 2 comments
trafficstars

System Info

- `Accelerate` version: 0.27.2
- Platform: Linux-5.15.0-1054-azure-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.21.5
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 429.87 GB
- GPU type: Tesla V100-PCIE-16GB
- `Accelerate` default config:
	Not found

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

I'm trying to perform fine-tuning over Mixtral using peft on a Databricks single node with 4 GPUs. The model doesn't fit on a single GPU so I would like to use Naive PP like explained in PR https://github.com/huggingface/accelerate/pull/1523. Below an example:

from transformers import MixtralForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
from accelerate import Accelerator, DistributedType

model = MixtralForSequenceClassification.from_pretrained("/my_model_path", num_labels=1, device_map='auto')
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type=TaskType.SEQ_CLS,
)
lora_model = prepare_model_for_kbit_training(model)
lora_model = get_peft_model(lora_model, peft_config)
accelerator = Accelerator()
lora_model = accelerator.prepare(lora_model)
lora_model.print_trainable_parameters()

And the error:

ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`

Expected behavior

The problem is that I don't understand how to force Naive PP. There must be a way to force DistributedType to NO in accelerate configuration via python code but I couldn't find it.

tom-ph avatar Feb 25 '24 16:02 tom-ph

Launch the code using python as it mentions, not accelerate launch

muellerzr avatar Feb 25 '24 17:02 muellerzr

Since I'm running the code from a databricks notebook (similar to a Jupiter or colab notebook), I'm not actively doing either. I also doubt databricks is calling accelerate launch, although I'm not sure. Is there a way to configure the Accelerator from inside the script? For example specifying num_processes or any other needed configuration. Or maybe pointing to a custom accelerate config file.

tom-ph avatar Feb 25 '24 18:02 tom-ph

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 27 '24 15:03 github-actions[bot]