accelerate
accelerate copied to clipboard
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
@younesbelkada (Thanks again for developing these great libraries and responding on Github!)
Related issue: https://github.com/huggingface/accelerate/issues/1412
With the bleeding edge transformers, I cannot combine PEFT and accelerate to do parameter-efficient fine-tuning with naive pipeline parallelism (i.e., splitting a model loaded on 8-bit across multiple GPUs).
Are both PEFT and accelerate not supporting such use cases? The code is working on earlier transformers version so wondering about it.
File "/home/ec2-user/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1665, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/ec2-user/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1768, in _inner_training_loop
self.model, self.optimizer, self.lr_scheduler
File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1144, in prepare
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1144, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 995, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1201, in prepare_model
"You can't train a model that has been loaded in 8-bit precision on multiple devices."
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
Here is the subset of outcomes from pip3 list regarding the package version:
Package Version
------------------------ -----------
accelerate 0.19.0
transformers 4.30.0.dev0
peft 0.3.0
@younesbelkada (Thanks again for developing these great libraries and responding on Github!)
Related issue: #1412
With the bleeding edge
transformers, I cannot combine PEFT and accelerate to do parameter-efficient fine-tuning with naive pipeline parallelism (i.e., splitting a model loaded on 8-bit across multiple GPUs). Are both PEFT and accelerate not supporting such use cases? The code is working on earliertransformersversion so wondering about it.File "/home/ec2-user/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1665, in train ignore_keys_for_eval=ignore_keys_for_eval, File "/home/ec2-user/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1768, in _inner_training_loop self.model, self.optimizer, self.lr_scheduler File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1144, in prepare self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1144, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 995, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py", line 1201, in prepare_model "You can't train a model that has been loaded in 8-bit precision on multiple devices." ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.Here is the subset of outcomes from
pip3 listregarding the package version:Package Version ------------------------ ----------- accelerate 0.19.0 transformers 4.30.0.dev0 peft 0.3.0
same error. did you solved?
Hi @akkikiki
Thanks so much for your kind words and the report
I have digged the problem and appears it was my mistake and I forgot to add an extra check. NPP should not be supported under any distributed regime by definitio as NPP paradigm is purely sequential (i.e. should be run just with python xxxx.py)
https://github.com/huggingface/accelerate/pull/1523 should hopefully fix the issue
Does my explanation make sense? Please let me know if you have any question
Hi @akkikiki Thanks so much for your kind words and the report I have digged the problem and appears it was my mistake and I forgot to add an extra check. NPP should not be supported under any distributed regime by definitio as NPP paradigm is purely sequential (i.e. should be run just with
python xxxx.py) #1523 should hopefully fix the issue Does my explanation make sense? Please let me know if you have any question
Are multiple devices and sequential not in conflict? The problem I am encountering now is that accelerate does not support quantized models running on multiple devices, but just using a single gpu will oom~
Are multiple devices and sequential not in conflict?
What I meant by sequential is that the activations and gradients will be passed from one GPU to another sequentially, so one by one. In that case I don't see why these are in conflict as long as the the other GPUs are kept idle while the active one is computing the gradients and activations.
If you use PEFT to train your model and load it across multiple GPU, with #1523 it should be possible
Thank you so much @younesbelkada!
Yes, (at least currently) NOT looking for distributed training (e.g., distributed data parallel through trochrun) when load_in_8bit (or 4bit) is turned on. Only NPP.
Looking forward for the https://github.com/huggingface/accelerate/pull/1523 to be merged!
@dylanwwang https://github.com/huggingface/accelerate/pull/1523 should solve your error too :)
Hi @akkikiki Thanks so much for your kind words and the report I have digged the problem and appears it was my mistake and I forgot to add an extra check. NPP should not be supported under any distributed regime by definitio as NPP paradigm is purely sequential (i.e. should be run just with
python xxxx.py) #1523 should hopefully fix the issue Does my explanation make sense? Please let me know if you have any question
Thanks a lot @akkikiki !!
Hi @akkikiki Thanks so much for your kind words and the report I have digged the problem and appears it was my mistake and I forgot to add an extra check. NPP should not be supported under any distributed regime by definitio as NPP paradigm is purely sequential (i.e. should be run just with
python xxxx.py) #1523 should hopefully fix the issue Does my explanation make sense? Please let me know if you have any questionAre multiple devices and sequential not in conflict? The problem I am encountering now is that accelerate does not support quantized models running on multiple devices, but just using a single gpu will oom~
same errors..
This should be fixed if you uninstall accelerate and re-install it from source
This should be fixed if you uninstall accelerate and re-install it from source
Hi, thanks for your great work. I still find some problems with the fixed version.
If I set load_in_4bit as True(not load_in_8bit, 8bit works fine), the code still cannot run.
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model_id = "facebook/opt-350m"
accelerator = Accelerator()
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)
model = prepare_model_for_kbit_training(model)
print(set(model.hf_device_map.values()))
model = get_peft_model(model, config)
model = accelerator.prepare(model)
Hi @zhangzuizui I just ran the script you shared and it works fine on my side, could you double check your accelerate version?
Hi @zhangzuizui I just ran the script you shared and it works fine on my side, could you double check your accelerate version?
Sorry for bothering you. I've made a mistake.
I checked my accelerate version, and reinstalled it:
pip3 uninstall accelerate -y pip3 install git+https://github.com/huggingface/accelerate Collecting git+https://github.com/huggingface/accelerate Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-e0s05ry2 Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-e0s05ry2 Resolved https://github.com/huggingface/accelerate to commit 0ab72613a7895ac161f7c021e6aa83eaef750963 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done
With the model "facebook/opt-350m" and code above, that works fine, but when I changed "facebook/opt-350m" to "bigscience/bloomz-7b1-mt", and set load_in_4bit as True (as I said before, load_in_8bit works fine, it's strange), the error would occur.
code:
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
model_id = "bigscience/bloomz-7b1-mt"
accelerator = Accelerator()
config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)
model = prepare_model_for_kbit_training(model)
print(set(model.hf_device_map.values()))
model = get_peft_model(model, config)
model = accelerator.prepare(model)
log:
python3 test.py
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so... {3} Traceback (most recent call last): File "/opt/tiger/isp_llm_modeling/test.py", line 24, in
model = accelerator.prepare(model) File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1182, in prepare result = tuple( File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1183, in self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1022, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1254, in prepare_model raise ValueError( ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for exampledevice_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
When I set load_in_8bit as True:
AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
load_in_8bit=True
)
log:
python3 test.py
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so... {0, 1, 2, 3}
besides, I just find if I add some parameters in from_pretrained method, may cause error too, like this:
AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
load_in_8bit=True,
torch_dtype=torch.float32 # Add this may cause error
)
@zhangzuizui thanks for providing the details, what is the hardware type you are using (GPU VRAM?) Thanks!
@younesbelkada I test my code on 4 A100(80GB) and 4 V100, raise the same error
Hmmm I couldn't reproduce on 2xNVIDIA T4 16GB, can you confirm with me your accelerate & transformers versions?
I tried to update both my accelerate and transformers to the newest(master branch), that not work. :(
Are multiple devices and sequential not in conflict?
What I meant by sequential is that the activations and gradients will be passed from one GPU to another sequentially, so one by one. In that case I don't see why these are in conflict as long as the the other GPUs are kept idle while the active one is computing the gradients and activations.
If you use PEFT to train your model and load it across multiple GPU, with #1523 it should be possible
@younesbelkada after merge,now i can load and train a 65-b model on a 4A100 machine but very slow,run nvidia-smi command,i found this model hold on 4A100 gpu memory,but 3 GPU-Util is 0% and only one computing. does NPP work like this?
@dylanwwang this is odd, I think you have put your entire model in a single GPU, how did you initialized your model? using device_map="auto"?
@dylanwwang this is odd, I think you have put your entire model in a single GPU, how did you initialized your model? using
device_map="auto"?
@younesbelkada model is indeed divided into 4 GPUs,and initialize with device_map="auto"
can you share the output of nvidia-smi during training?
can you share the output of
nvidia-smiduring training?
@younesbelkada
and computing gpu not fixed
@dylanwwang I think this is expected in Naive Pipeline Parallelism, the GPUs are used one by one while other GPUs are kept idle as explained above. Can you confirm for instance that during training the volatile GPU util changes over time (more specifically GPU1 goes 95% then all the others go 0%, then GPU2 goes 95% while all other goes 0%, and so on) Thanks!
@dylanwwang I think this is expected in Naive Pipeline Parallelism, the GPUs are used one by one while other GPUs are kept idle as explained above. Can you confirm for instance that during training the volatile GPU util changes over time (more specifically GPU1 goes 95% then all the others go 0%, then GPU2 goes 95% while all other goes 0%, and so on) Thanks!
@younesbelkada yes, in the order of 3 -> 2 -> 1 -> 0. but, even the naive Data Parallel with AllReduce,shouldn't be like this ?
but, even the naive Data Parallel with AllReduce,shouldn't be like this ?
device_map="auto" is not data parallelism, it's model parallelism (your model is split across the GPUs). It is not compatible with Data parallelism. If you want to combine data parallelism and model parallelism, you need to use FSDP or DeepSpeed.
but, even the naive Data Parallel with AllReduce,shouldn't be like this ?
device_map="auto"is not data parallelism, it's model parallelism (your model is split across the GPUs). It is not compatible with Data parallelism. If you want to combine data parallelism and model parallelism, you need to use FSDP or DeepSpeed.
@sgugger @kevinuserdd ok. but DeepSpeed will OOM when load model, so FSDP/DeepSpeed is not compatible with transformers quantize?
Hi @dylanwwang As far as I know unfortunately FSDP is not compatible with transformers quantization :/
Hi @dylanwwang As far as I know unfortunately FSDP is not compatible with transformers quantization :/
is there an existing issue for this?
As far as I know this is in the roadmap for bitsandbytes, feel free to post an issue there!
Hi everyone,
I encountered a similar error with llama3 - 7b.
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
To address this issue, I tried the following solution:
device_index = Accelerator().process_index
device_map = {"": device_index}
While this helped resolve the initial error, it has now led to an Out of Memory (OOM) issue. I find this situation somewhat unexpected, considering I am using 8 Nvidia A100 GPUs (each with 40GB of memory) and have never experienced OOM errors with this configuration when working with models of similar size. I am currently performing QLoRA during the fine-tuning process.
Following are the versions of libraries that I am using: transformers =4.40.1 accelerate=0.30.0.dev0 trl=0.8.6 peft=0.10.0
I tried using device_map={"":0}, but I am still encountering an Out of Memory (OOM) error.
Here are my LoRA params:
Rank = 8 Alpha = 8 Using 4bit quantization while loading the base model. Has anyone figured out a solution to this problem? Thanks in advance!