Jeevan
Jeevan
If I run the following command: ``` accelerate launch -m lm_eval --model hf --model_args "pretrained=TheBloke/Llama-2-7B-Chat-GPTQ,gptq=True,load_in_4bit=True" --tasks "arc_challenge" --num_fewshot 25 --batch_size auto ``` I get the following error: ``` ValueError: You...
**Describe the bug** I am trying to run autoGPTQ of Llama-7B on RTX 4090 with num_samples > 128 but go OOM. I thought that the number of samples would not...
I have loaded this model `TheBloke/Llama-2-7B-Chat-GPTQ` from HuggingFace and the final linear layer (`lm_head`) is a standard linear layer. Is there a way to quantize it? Would performance drastically decrease?
**Describe the bug** Running FP8 PTQ of Llama3-8b on 1x 4090 (24GB) goes OOM? Is this expected? vLLM FP8 quantization works on the same GPU. What are the minimum requirements...
**Describe the bug** I am trying to quantize Llama3.1 using GPTQ but encounter an error where tensors are on CPU and GPU. But this used to work for Llama3 on...