lm-evaluation-harness
lm-evaluation-harness copied to clipboard
AutoGPTQ not working with HuggingFace accelerate (multi GPU)
If I run the following command:
accelerate launch -m lm_eval --model hf --model_args "pretrained=TheBloke/Llama-2-7B-Chat-GPTQ,gptq=True,load_in_4bit=True" --tasks "arc_challenge" --num_fewshot 25 --batch_size auto
I get the following error:
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
If I add device_map_option=balanced
to --model_args
, I get the same issue.
But if I try an unquantized model with multi GPU e.g. model_args="pretrained=meta-llama/Llama-2-7b-chat-hf"
, it works perfectly.
Is 4-bit AutoGPTQ compatible with multi GPU (accelerate)?
I installed AutoGPTQ from source, as follows
pip install "git+https://github.com/PanQiWei/[email protected]"
instead of pip install -e ".[gptq]"
, because I encountered errors with the latter.
GPTQ with single GPU works fine.
What are the contents of your accelerate config?
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Note that we've changed the GPTQ argument name to autogptq=True
to match the library, and that load_in_4bit=True
should not be set if using GPTQ. Does making these changes solve your issue?
We will ideally be reexamining our quantization integrations as part of #1225 to make error messaging around these sorts of conflicts more intuitive.
I'm using v0.4.0, so I believe the argument is still gptq=True
? Using autogptq=True
threw an error anyways.
I tried removing load_in_4bit=True
but the error still persists.
I see--I'd recommend trying with the current main
branch installed from github and seeing if there is still an error. We'll put out a v0.4.1 on PyPI very soon.
@JeevanBhoot to narrow this down, could you try inserting device_map={"":torch.cuda.current_device()}
into the from_pretrained()
call?
It sounds as though, despite you not passing a device_map
or using parallelize=True
in the first snippet (this is our intended behavior + usecase), Huggingface or AutoGPTQ is inferring some device map nevertheless, which we don't want. (Again, load_in_4bit=True
should also be removed here.)
To which from_pretrained()
calls should I add the device_map
argument? I added it to all from_pretrained()
calls in huggingface.py
but the error persisted.
I have removed load_in_4bit=True
and I am now working from main
.