lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

AutoGPTQ not working with HuggingFace accelerate (multi GPU)

Open JeevanBhoot opened this issue 1 year ago • 4 comments

If I run the following command:

accelerate launch -m lm_eval --model hf --model_args "pretrained=TheBloke/Llama-2-7B-Chat-GPTQ,gptq=True,load_in_4bit=True" --tasks "arc_challenge" --num_fewshot 25 --batch_size auto

I get the following error:

ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.

If I add device_map_option=balanced to --model_args, I get the same issue.

But if I try an unquantized model with multi GPU e.g. model_args="pretrained=meta-llama/Llama-2-7b-chat-hf", it works perfectly.

Is 4-bit AutoGPTQ compatible with multi GPU (accelerate)?

I installed AutoGPTQ from source, as follows

pip install "git+https://github.com/PanQiWei/[email protected]"

instead of pip install -e ".[gptq]", because I encountered errors with the latter.

GPTQ with single GPU works fine.

JeevanBhoot avatar Jan 05 '24 12:01 JeevanBhoot

What are the contents of your accelerate config?

haileyschoelkopf avatar Jan 05 '24 15:01 haileyschoelkopf

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

JeevanBhoot avatar Jan 05 '24 16:01 JeevanBhoot

Note that we've changed the GPTQ argument name to autogptq=True to match the library, and that load_in_4bit=True should not be set if using GPTQ. Does making these changes solve your issue?

We will ideally be reexamining our quantization integrations as part of #1225 to make error messaging around these sorts of conflicts more intuitive.

haileyschoelkopf avatar Jan 05 '24 18:01 haileyschoelkopf

I'm using v0.4.0, so I believe the argument is still gptq=True? Using autogptq=True threw an error anyways.

I tried removing load_in_4bit=True but the error still persists.

JeevanBhoot avatar Jan 05 '24 21:01 JeevanBhoot

I see--I'd recommend trying with the current main branch installed from github and seeing if there is still an error. We'll put out a v0.4.1 on PyPI very soon.

haileyschoelkopf avatar Jan 08 '24 14:01 haileyschoelkopf

@JeevanBhoot to narrow this down, could you try inserting device_map={"":torch.cuda.current_device()} into the from_pretrained() call?

It sounds as though, despite you not passing a device_map or using parallelize=True in the first snippet (this is our intended behavior + usecase), Huggingface or AutoGPTQ is inferring some device map nevertheless, which we don't want. (Again, load_in_4bit=True should also be removed here.)

haileyschoelkopf avatar Jan 09 '24 21:01 haileyschoelkopf

To which from_pretrained() calls should I add the device_map argument? I added it to all from_pretrained() calls in huggingface.py but the error persisted.

I have removed load_in_4bit=True and I am now working from main.

JeevanBhoot avatar Jan 10 '24 10:01 JeevanBhoot