qlora accelerate does not support multi-gpu load int-8bit

ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on.

ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices

Jun 05 '23 08:06 kevinuserdd

same error here

Jun 05 '23 13:06 itjuba

same issue here.

Jun 05 '23 15:06 FHL1998

same error here

Jun 05 '23 17:06 zyxyxz

same error

Jun 06 '23 03:06 Ted8000

https://github.com/huggingface/accelerate/issues/1515#issuecomment-1577151399 Solution to the problem

Jun 06 '23 06:06 lin1490188

https://github.com/huggingface/accelerate/pull/1523 being merged, if you uninstall accelerate and reinstall it from source:

pip install git+https://github.com/huggingface/accelerate.git

it should be fixed

Jun 06 '23 12:06 younesbelkada

I install the latest version of acclerate:

pip install git+https://github.com/huggingface/accelerate.git

load the model :

device_map = 'auto'
device_map = {0: '30000MB', 1: '30000MB', 2: '30000MB', 3: '30000MB'}
model = BloomForCausalLM.from_pretrained(
      args.model_name_or_path,
      device_map=device_map,
      max_memory=max_memory,
      load_in_4bit=True,
      torch_dtype=torch.float16,
      quantization_config=BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_compute_dtype=torch.float16,
          bnb_4bit_use_double_quant=True,
          bnb_4bit_quant_type="nf4",
          llm_int8_threshold=6.0,
          llm_int8_has_fp16_weight=False,
      ),
  )

I run the code:

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py

but it still throw the error:

  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1756, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1258, in prepare_model
    raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

Looking forward to you answer, thanks @younesbelkada

Jun 11 '23 10:06 yangjianxin1

I install the latest version of acclerate:

pip install git+https://github.com/huggingface/accelerate.git

load the model :

device_map = 'auto'
device_map = {0: '30000MB', 1: '30000MB', 2: '30000MB', 3: '30000MB'}
model = BloomForCausalLM.from_pretrained(
      args.model_name_or_path,
      device_map=device_map,
      max_memory=max_memory,
      load_in_4bit=True,
      torch_dtype=torch.float16,
      quantization_config=BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_compute_dtype=torch.float16,
          bnb_4bit_use_double_quant=True,
          bnb_4bit_quant_type="nf4",
          llm_int8_threshold=6.0,
          llm_int8_has_fp16_weight=False,
      ),
  )

I run the code:

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py

but it still throw the error:

  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1756, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 1258, in prepare_model
    raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

Looking forward to you answer, thanks @younesbelkada

I use 4*V100

Jun 11 '23 10:06 yangjianxin1

Hi @yangjianxin1 It seems there is a mistake in your script, use instead:

device_map = 'auto'
max_memory = {0: '30GB', 1: '30GB', 2: '30GB', 3: '30GB'}
model = BloomForCausalLM.from_pretrained(
      args.model_name_or_path,
      device_map=device_map,
      max_memory=max_memory,
      load_in_4bit=True,
      torch_dtype=torch.float16,
      quantization_config=BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_compute_dtype=torch.float16,
          bnb_4bit_use_double_quant=True,
          bnb_4bit_quant_type="nf4",
          llm_int8_threshold=6.0,
          llm_int8_has_fp16_weight=False,
      ),
  )

Jun 12 '23 07:06 younesbelkada

Hi @yangjianxin1 It seems there is a mistake in your script, use instead:

device_map = 'auto'
max_memory = {0: '30GB', 1: '30GB', 2: '30GB', 3: '30GB'}
model = BloomForCausalLM.from_pretrained(
      args.model_name_or_path,
      device_map=device_map,
      max_memory=max_memory,
      load_in_4bit=True,
      torch_dtype=torch.float16,
      quantization_config=BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_compute_dtype=torch.float16,
          bnb_4bit_use_double_quant=True,
          bnb_4bit_quant_type="nf4",
          llm_int8_threshold=6.0,
          llm_int8_has_fp16_weight=False,
      ),
  )

thanks for your reply, I try the setting, it also throws the same error.

Jun 12 '23 07:06 yangjianxin1

@yangjianxin1 Thanks! What is the model you are trying to fit?

Jun 12 '23 07:06 younesbelkada

Also can you print the result of model.hf_device_map after loading the model

Jun 12 '23 07:06 younesbelkada

thanks for you help, I have solve the problem, ddp_find_unused_parameters=False just like the code: https://github.com/yangjianxin1/Firefly/blob/master/train_qlora.py#L104

Jun 19 '23 03:06 yangjianxin1

updated accelerate from .21 to .23 and got fixed!

Sep 24 '23 00:09 sadransh

qlora qlora copied to clipboard

accelerate does not support multi-gpu load int-8bit

qlora
qlora copied to clipboard