GPTQModel Cannot Replicate Reported GSM8K-CoT Results from HF Model Using GPTQModel Codebase

Hi,

Firstly, I want to commend you on the incredible work you've done with this GPTQModel. It's a truly innovative approach, and I’m very excited about the potential it holds.

I’ve been working with the Meta-Llama-3.2-1B-Instruct model from the HuggingFace page below, and I’m having some issues replicating the reported results for GSM8K-CoT:

https://huggingface.co/ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5

I am using the model from HuggingFace and performing quantization using the author’s codebase as outlined in the README, with exactly the same hyperparameters as those provided in the link above. Despite this, I’m seeing a significant performance gap (~20%) on GSM8K-CoT. This difference is not just isolated to GSM8K, but also appears in HumanEval and ARC-Challenge, with several percentage points of drop in performance on those tasks as well.

What I did

Model: Meta-Llama-3.2-1B-Instruct from HuggingFace

Quantization: GPTQ with bits=4, group_size=32, desc_act=True, static_groups=True, following the author's README example for quantization

Kernel: Auto-selected MarlinQuantLinear

Evaluation: lm-eval with task gsm8k_cot_llama

Calibration datasets: tested both wikitext2 and c4

Sampling settings:

do_sample=False

temperature=None

top_p=None (these were set to ensure reproducibility and reduce sampling variance — the gap persists even with recommended settings)

My questions

Was this model quantized after instruction tuning or CoT fine-tuning?

Was any special prompt formatting or chat template applied during the evaluation phase?

Are there any internal differences that might explain the ~20% performance gap I’m seeing on GSM8K-CoT, and the similar drops on other tasks like HumanEval / ARC-Challenge?

Reason for urgency

I am planning to build upon your codebase and make some innovative modifications to explore new research directions. Given the potential for meaningful contributions to the field, I am eager to resolve this issue as soon as possible to continue with the next steps in my work.

I would greatly appreciate your help, and I’m hoping for your guidance on what might be causing this discrepancy. Your timely response would be incredibly valuable for my ongoing project.

Thank you so much for your time and for sharing such an impactful codebase!

Best regards, A struggling graduate student

Apr 25 '25 09:04 Eijnewgnaw

Nice idea. We did have the idea of "custom" and then manually provide models so you can (essentially) create these sorts of things. But the issue is how to be flexible enough to give good coverage across the board. There's specifying models (below) and a generic filter option (so include/exclude filter types).

Would something like this work - not supported yet obviously, just brainstorming:

      - url: "..."
        name: "flarellm"
        priority: 100
        type: "custom"
        profile: "cloudflare-ai" # <--- only available for custom types
        models:
        - meta/llama-3.2-1b-instruct
        - meta/llama-3.3-1b-instruct

The core APIs etc would be configured via the existing profile infrastructure, this way you can configure the same provider and customise the models etc. Auth would be prerequisite for this.

Auth endpoints are also coming, just having some issues to work through. I think that will be ready in a couple of weeks.

Aug 14 '25 11:08 thushan

That config looks good! The name profile may need some bike shedding, as you also have type there which I had assumed referred to the profile (unless I've misunderstood).

Also, for posterity, the model name would include that @cf as they have other prefixes such as @hf (hugging face).

Auth endpoints are also coming, just having some issues to work through. I think that will be ready in a couple of weeks.

Sweet! let me know if there is a way I can help

Aug 15 '25 02:08 ghostdevv