accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.

Open akkikiki opened this issue 2 years ago • 6 comments

@younesbelkada (Thanks again for developing these great libraries and responding on Github!)

Related issue: https://github.com/huggingface/accelerate/issues/1412

With the bleeding edge transformers, I cannot combine PEFT and accelerate to do parameter-efficient fine-tuning with naive pipeline parallelism (i.e., splitting a model loaded on 8-bit across multiple GPUs). Are both PEFT and accelerate not supporting such use cases? The code is working on earlier transformers version so wondering about it.

  File "/home/ec2-user/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1665, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/home/ec2-user/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1768, in _inner_training_loop
    self.model, self.optimizer, self.lr_scheduler
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1144, in prepare
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1144, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 995, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1201, in prepare_model
    "You can't train a model that has been loaded in 8-bit precision on multiple devices."
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.

Here is the subset of outcomes from pip3 list regarding the package version:

Package                  Version
------------------------ -----------
accelerate               0.19.0
transformers             4.30.0.dev0
peft                     0.3.0

akkikiki avatar Jun 02 '23 20:06 akkikiki

@younesbelkada (Thanks again for developing these great libraries and responding on Github!)

Related issue: #1412

With the bleeding edge transformers, I cannot combine PEFT and accelerate to do parameter-efficient fine-tuning with naive pipeline parallelism (i.e., splitting a model loaded on 8-bit across multiple GPUs). Are both PEFT and accelerate not supporting such use cases? The code is working on earlier transformers version so wondering about it.

  File "/home/ec2-user/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1665, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/home/ec2-user/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1768, in _inner_training_loop
    self.model, self.optimizer, self.lr_scheduler
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1144, in prepare
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1144, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 995, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1201, in prepare_model
    "You can't train a model that has been loaded in 8-bit precision on multiple devices."
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.

Here is the subset of outcomes from pip3 list regarding the package version:

Package                  Version
------------------------ -----------
accelerate               0.19.0
transformers             4.30.0.dev0
peft                     0.3.0

same error. did you solved?

dylanwwang avatar Jun 05 '23 09:06 dylanwwang

Hi @akkikiki Thanks so much for your kind words and the report I have digged the problem and appears it was my mistake and I forgot to add an extra check. NPP should not be supported under any distributed regime by definitio as NPP paradigm is purely sequential (i.e. should be run just with python xxxx.py) https://github.com/huggingface/accelerate/pull/1523 should hopefully fix the issue Does my explanation make sense? Please let me know if you have any question

younesbelkada avatar Jun 05 '23 10:06 younesbelkada

Hi @akkikiki Thanks so much for your kind words and the report I have digged the problem and appears it was my mistake and I forgot to add an extra check. NPP should not be supported under any distributed regime by definitio as NPP paradigm is purely sequential (i.e. should be run just with python xxxx.py) #1523 should hopefully fix the issue Does my explanation make sense? Please let me know if you have any question

Are multiple devices and sequential not in conflict? The problem I am encountering now is that accelerate does not support quantized models running on multiple devices, but just using a single gpu will oom~

dylanwwang avatar Jun 05 '23 11:06 dylanwwang

Are multiple devices and sequential not in conflict?

What I meant by sequential is that the activations and gradients will be passed from one GPU to another sequentially, so one by one. In that case I don't see why these are in conflict as long as the the other GPUs are kept idle while the active one is computing the gradients and activations.

If you use PEFT to train your model and load it across multiple GPU, with #1523 it should be possible

younesbelkada avatar Jun 05 '23 11:06 younesbelkada

Thank you so much @younesbelkada! Yes, (at least currently) NOT looking for distributed training (e.g., distributed data parallel through trochrun) when load_in_8bit (or 4bit) is turned on. Only NPP. Looking forward for the https://github.com/huggingface/accelerate/pull/1523 to be merged!

@dylanwwang https://github.com/huggingface/accelerate/pull/1523 should solve your error too :)

Hi @akkikiki Thanks so much for your kind words and the report I have digged the problem and appears it was my mistake and I forgot to add an extra check. NPP should not be supported under any distributed regime by definitio as NPP paradigm is purely sequential (i.e. should be run just with python xxxx.py) #1523 should hopefully fix the issue Does my explanation make sense? Please let me know if you have any question

akkikiki avatar Jun 05 '23 16:06 akkikiki

Thanks a lot @akkikiki !!

younesbelkada avatar Jun 05 '23 17:06 younesbelkada

Hi @akkikiki Thanks so much for your kind words and the report I have digged the problem and appears it was my mistake and I forgot to add an extra check. NPP should not be supported under any distributed regime by definitio as NPP paradigm is purely sequential (i.e. should be run just with python xxxx.py) #1523 should hopefully fix the issue Does my explanation make sense? Please let me know if you have any question

Are multiple devices and sequential not in conflict? The problem I am encountering now is that accelerate does not support quantized models running on multiple devices, but just using a single gpu will oom~

same errors..

kevinuserdd avatar Jun 06 '23 07:06 kevinuserdd

This should be fixed if you uninstall accelerate and re-install it from source

younesbelkada avatar Jun 06 '23 12:06 younesbelkada

This should be fixed if you uninstall accelerate and re-install it from source

Hi, thanks for your great work. I still find some problems with the fixed version.

If I set load_in_4bit as True(not load_in_8bit, 8bit works fine), the code still cannot run.

from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_id = "facebook/opt-350m"
accelerator = Accelerator()

config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    target_modules=["q_proj", "v_proj"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)
model = prepare_model_for_kbit_training(model)

print(set(model.hf_device_map.values()))

model = get_peft_model(model, config)

model = accelerator.prepare(model)

zhangzuizui avatar Jun 07 '23 07:06 zhangzuizui

Hi @zhangzuizui I just ran the script you shared and it works fine on my side, could you double check your accelerate version?

younesbelkada avatar Jun 07 '23 11:06 younesbelkada

Hi @zhangzuizui I just ran the script you shared and it works fine on my side, could you double check your accelerate version?

Sorry for bothering you. I've made a mistake.

I checked my accelerate version, and reinstalled it:

pip3 uninstall accelerate -y pip3 install git+https://github.com/huggingface/accelerate Collecting git+https://github.com/huggingface/accelerate Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-e0s05ry2 Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-e0s05ry2 Resolved https://github.com/huggingface/accelerate to commit 0ab72613a7895ac161f7c021e6aa83eaef750963 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done

With the model "facebook/opt-350m" and code above, that works fine, but when I changed "facebook/opt-350m" to "bigscience/bloomz-7b1-mt", and set load_in_4bit as True (as I said before, load_in_8bit works fine, it's strange), the error would occur.

code:

from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

model_id = "bigscience/bloomz-7b1-mt"
accelerator = Accelerator()

config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)
model = prepare_model_for_kbit_training(model)

print(set(model.hf_device_map.values()))

model = get_peft_model(model, config)

model = accelerator.prepare(model)

log:

python3 test.py

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so... {3} Traceback (most recent call last): File "/opt/tiger/isp_llm_modeling/test.py", line 24, in model = accelerator.prepare(model) File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1182, in prepare result = tuple( File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1183, in self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1022, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1254, in prepare_model raise ValueError( ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

When I set load_in_8bit as True:

AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    load_in_8bit=True
)

log:

python3 test.py

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so... {0, 1, 2, 3}

besides, I just find if I add some parameters in from_pretrained method, may cause error too, like this:

AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    load_in_8bit=True, 
    torch_dtype=torch.float32  # Add this may cause error
)

zhangzuizui avatar Jun 07 '23 15:06 zhangzuizui

@zhangzuizui thanks for providing the details, what is the hardware type you are using (GPU VRAM?) Thanks!

younesbelkada avatar Jun 07 '23 15:06 younesbelkada

@younesbelkada I test my code on 4 A100(80GB) and 4 V100, raise the same error

zhangzuizui avatar Jun 08 '23 02:06 zhangzuizui

Hmmm I couldn't reproduce on 2xNVIDIA T4 16GB, can you confirm with me your accelerate & transformers versions?

younesbelkada avatar Jun 08 '23 08:06 younesbelkada

I tried to update both my accelerate and transformers to the newest(master branch), that not work. :(

zhangzuizui avatar Jun 08 '23 08:06 zhangzuizui

Are multiple devices and sequential not in conflict?

What I meant by sequential is that the activations and gradients will be passed from one GPU to another sequentially, so one by one. In that case I don't see why these are in conflict as long as the the other GPUs are kept idle while the active one is computing the gradients and activations.

If you use PEFT to train your model and load it across multiple GPU, with #1523 it should be possible

@younesbelkada after merge,now i can load and train a 65-b model on a 4A100 machine but very slow,run nvidia-smi command,i found this model hold on 4A100 gpu memory,but 3 GPU-Util is 0% and only one computing. does NPP work like this?

dylanwwang avatar Jun 09 '23 08:06 dylanwwang

@dylanwwang this is odd, I think you have put your entire model in a single GPU, how did you initialized your model? using device_map="auto"?

younesbelkada avatar Jun 09 '23 08:06 younesbelkada

@dylanwwang this is odd, I think you have put your entire model in a single GPU, how did you initialized your model? using device_map="auto"?

@younesbelkada model is indeed divided into 4 GPUs,and initialize with device_map="auto"

dylanwwang avatar Jun 09 '23 08:06 dylanwwang

can you share the output of nvidia-smi during training?

younesbelkada avatar Jun 09 '23 08:06 younesbelkada

can you share the output of nvidia-smi during training?

@younesbelkada image and computing gpu not fixed

dylanwwang avatar Jun 09 '23 08:06 dylanwwang

@dylanwwang I think this is expected in Naive Pipeline Parallelism, the GPUs are used one by one while other GPUs are kept idle as explained above. Can you confirm for instance that during training the volatile GPU util changes over time (more specifically GPU1 goes 95% then all the others go 0%, then GPU2 goes 95% while all other goes 0%, and so on) Thanks!

younesbelkada avatar Jun 09 '23 10:06 younesbelkada

@dylanwwang I think this is expected in Naive Pipeline Parallelism, the GPUs are used one by one while other GPUs are kept idle as explained above. Can you confirm for instance that during training the volatile GPU util changes over time (more specifically GPU1 goes 95% then all the others go 0%, then GPU2 goes 95% while all other goes 0%, and so on) Thanks!

@younesbelkada yes, in the order of 3 -> 2 -> 1 -> 0. but, even the naive Data Parallel with AllReduce,shouldn't be like this ?

dylanwwang avatar Jun 09 '23 10:06 dylanwwang

but, even the naive Data Parallel with AllReduce,shouldn't be like this ?

device_map="auto" is not data parallelism, it's model parallelism (your model is split across the GPUs). It is not compatible with Data parallelism. If you want to combine data parallelism and model parallelism, you need to use FSDP or DeepSpeed.

sgugger avatar Jun 09 '23 12:06 sgugger

but, even the naive Data Parallel with AllReduce,shouldn't be like this ?

device_map="auto" is not data parallelism, it's model parallelism (your model is split across the GPUs). It is not compatible with Data parallelism. If you want to combine data parallelism and model parallelism, you need to use FSDP or DeepSpeed.

@sgugger @kevinuserdd ok. but DeepSpeed will OOM when load model, so FSDP/DeepSpeed is not compatible with transformers quantize?

dylanwwang avatar Jun 10 '23 03:06 dylanwwang

Hi @dylanwwang As far as I know unfortunately FSDP is not compatible with transformers quantization :/

younesbelkada avatar Jun 12 '23 07:06 younesbelkada

Hi @dylanwwang As far as I know unfortunately FSDP is not compatible with transformers quantization :/

is there an existing issue for this?

haochen806 avatar Sep 13 '23 03:09 haochen806

As far as I know this is in the roadmap for bitsandbytes, feel free to post an issue there!

younesbelkada avatar Sep 13 '23 07:09 younesbelkada

Hi everyone,

I encountered a similar error with llama3 - 7b.

ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

To address this issue, I tried the following solution:

device_index = Accelerator().process_index
device_map = {"": device_index}

While this helped resolve the initial error, it has now led to an Out of Memory (OOM) issue. I find this situation somewhat unexpected, considering I am using 8 Nvidia A100 GPUs (each with 40GB of memory) and have never experienced OOM errors with this configuration when working with models of similar size. I am currently performing QLoRA during the fine-tuning process.

Following are the versions of libraries that I am using: transformers =4.40.1 accelerate=0.30.0.dev0 trl=0.8.6 peft=0.10.0

I tried using device_map={"":0}, but I am still encountering an Out of Memory (OOM) error.

Here are my LoRA params:

Rank = 8 Alpha = 8 Using 4bit quantization while loading the base model. Has anyone figured out a solution to this problem? Thanks in advance!

mano3-1 avatar Apr 25 '24 14:04 mano3-1