GPTQModel [BUG] ValueError: Quantization: Failed due to NaN loss

Describe the bug

Quantization failed with a ValueError due to NaN loss when quantizing an 8B model, but the process succeeds for a 1B model.

GPU Info

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:F5:00.0 Off |                  N/A |
|  0%   31C    P8              17W / 350W |  13005MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

To Reproduce

from gptqmodel import QuantizeConfig, GPTQModel
from gptqmodel.quantization import FORMAT
from gptqmodel import BACKEND

import torch
from datasets import load_dataset

model_id = "sail/Sailor2-8B"

quant_config = QuantizeConfig(
    bits=4, 
    group_size=128,
    format=FORMAT.GPTQ,
)

model = GPTQModel.load(
    model_id,
    torch_dtype=torch.bfloat16,
    quantize_config=quant_config,
    trust_remote_code=True,
    device_map="auto"
)

def get_calib(tokenizer, nsamples, seqlen):
    traindata = load_dataset("itdainb/calibrate_vn", split="train").filter(
        lambda x: len(x["text"]) >= seqlen)

    return [tokenizer(example["text"]) for example in traindata.select(range(nsamples))]

calib_dataset = get_calib(model.tokenizer, 512, 2048)

model.quantize(
    calib_dataset,
    batch_size=8,
    backend = BACKEND.TRITON,
    calibration_dataset_concat_size=2048
)

model.model.config._name_or_path = model_id
model.save(f"./models/{model_id.split('/')[-1]}")

Model/Datasets

Model: Sailor2-8B Dataset: itdainb/calibrate_vn

Additional context

|█---------------------------------------| 0:00:00 / 0:00:00 [1/32] 3.1%
INFO - {'layer': 0, 'module': 'self_attn.k_proj', 'loss': '0.21572', 'damp': '0.01000', 'time': '1.411', 'fwd_time': '13.106'}
INFO - {'layer': 0, 'module': 'self_attn.v_proj', 'loss': '0.04058', 'damp': '0.01000', 'time': '1.784', 'fwd_time': '13.106'}
INFO - {'layer': 0, 'module': 'self_attn.q_proj', 'loss': '0.52667', 'damp': '0.01000', 'time': '1.198', 'fwd_time': '13.106'}
INFO - {'layer': 0, 'module': 'self_attn.o_proj', 'loss': '0.03305', 'damp': '0.01000', 'time': '1.175', 'fwd_time': '11.385'}
INFO - {'layer': 0, 'module': 'mlp.up_proj', 'loss': '0.80327', 'damp': '0.01000', 'time': '1.664', 'fwd_time': '12.352'}
INFO - {'layer': 0, 'module': 'mlp.gate_proj', 'loss': '1.07150', 'damp': '0.01000', 'time': '1.518', 'fwd_time': '12.352'}
INFO - {'layer': 0, 'module': 'mlp.down_proj', 'loss': '0.00920', 'damp': '0.01000', 'time': '6.979', 'fwd_time': '40.708'}
 Quantizing mlp.down_proj in layer 0 of 31 |██--------------------------------------| 0:01:45 / 0:28:00 [2/32] 6.2%
INFO - {'layer': 1, 'module': 'self_attn.k_proj', 'loss': '0.10450', 'damp': '0.01000', 'time': '1.169', 'fwd_time': '11.278'}
INFO - {'layer': 1, 'module': 'self_attn.v_proj', 'loss': '0.02623', 'damp': '0.01000', 'time': '0.995', 'fwd_time': '11.278'}
INFO - {'layer': 1, 'module': 'self_attn.q_proj', 'loss': '0.38751', 'damp': '0.01000', 'time': '1.066', 'fwd_time': '11.278'}
INFO - {'layer': 1, 'module': 'self_attn.o_proj', 'loss': '0.00495', 'damp': '0.01000', 'time': '1.196', 'fwd_time': '8.891'}
INFO - {'layer': 1, 'module': 'mlp.up_proj', 'loss': '9.03198', 'damp': '0.01000', 'time': '1.743', 'fwd_time': '10.079'}
INFO - {'layer': 1, 'module': 'mlp.gate_proj', 'loss': '21.71189', 'damp': '0.01000', 'time': '1.502', 'fwd_time': '10.079'}
INFO - {'layer': 1, 'module': 'mlp.down_proj', 'loss': '0.01888', 'damp': '0.01000', 'time': '6.962', 'fwd_time': '38.927'}
 Quantizing mlp.down_proj in layer 1 of 31 |███-------------------------------------| 0:03:19 / 0:35:22 [3/32] 9.4%
INFO - {'layer': 2, 'module': 'self_attn.k_proj', 'loss': '0.74365', 'damp': '0.01000', 'time': '1.170', 'fwd_time': '11.286'}
INFO - {'layer': 2, 'module': 'self_attn.v_proj', 'loss': '0.10587', 'damp': '0.01000', 'time': '1.056', 'fwd_time': '11.286'}
INFO - {'layer': 2, 'module': 'self_attn.q_proj', 'loss': '1.45657', 'damp': '0.01000', 'time': '1.056', 'fwd_time': '11.286'}
INFO - {'layer': 2, 'module': 'self_attn.o_proj', 'loss': '0.01072', 'damp': '0.01000', 'time': '1.169', 'fwd_time': '8.864'}
INFO - {'layer': 2, 'module': 'mlp.up_proj', 'loss': '13.38118', 'damp': '0.01000', 'time': '1.654', 'fwd_time': '10.117'}
INFO - {'layer': 2, 'module': 'mlp.gate_proj', 'loss': '33.63935', 'damp': '0.01000', 'time': '1.622', 'fwd_time': '10.117'}
INFO - {'layer': 2, 'module': 'mlp.down_proj', 'loss': '0.05232', 'damp': '0.01000', 'time': '6.951', 'fwd_time': '38.891'}
 Quantizing mlp.down_proj in layer 2 of 31 |█████-----------------------------------| 0:04:52 / 0:38:56 [4/32] 12.5%
INFO - {'layer': 3, 'module': 'self_attn.k_proj', 'loss': '0.97453', 'damp': '0.01000', 'time': '1.170', 'fwd_time': '11.299'}
INFO - {'layer': 3, 'module': 'self_attn.v_proj', 'loss': '0.18419', 'damp': '0.01000', 'time': '1.061', 'fwd_time': '11.299'}
INFO - {'layer': 3, 'module': 'self_attn.q_proj', 'loss': '2.11738', 'damp': '0.01000', 'time': '1.151', 'fwd_time': '11.299'}
INFO - {'layer': 3, 'module': 'self_attn.o_proj', 'loss': '0.04165', 'damp': '0.01000', 'time': '1.153', 'fwd_time': '8.890'}
INFO - {'layer': 3, 'module': 'mlp.up_proj', 'loss': '49.95827', 'damp': '0.01000', 'time': '1.650', 'fwd_time': '10.110'}
INFO - {'layer': 3, 'module': 'mlp.gate_proj', 'loss': '90.88012', 'damp': '0.01000', 'time': '1.481', 'fwd_time': '10.110'}
Losses sum item: nan
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 4
      1 from gptqmodel import BACKEND
      3 # increase `batch_size` to match gpu/vram specs to speed up quantization
----> 4 model.quantize(
      5     calib_dataset,
      6     batch_size=8,
      7     backend = BACKEND.TORCH,
      8     calibration_dataset_concat_size=2048
      9 )
     11 model.model.config._name_or_path = model_id
     12 model.save(f"./models/{model_id.split('/')[-1]}")

File [/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:116], in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File [/opt/quant/lib/python3.11/site-packages/gptqmodel/models/base.py:800], in BaseGPTQModel.quantize(self, calibration_dataset, calibration_dataset_register, calibration_dataset_concat_size, batch_size, calibration_enable_gpu_cache, tokenizer, logger_board, backend, buffered_fwd, auto_gc, task_name, project_name)
    796     static_groups = self.quantize_config.dynamic_get(layer_name, "static_groups", static_groups)
    799 # logger.info(f"Quantizing module START: {name}, {gptq[name].shape()}")
--> 800 scale, zero, g_idx, duration, avg_loss, damp_percent = gptq[name].quantize(
    801     percdamp=damp_percent,
    802     group_size=group_size,
    803     actorder=desc_act,
    804     static_groups=static_groups,
    805 )
    806 if task is not None:
    807     task.get_logger().report_scalar(
    808         title='Quantization Loss',
    809         series=f'layer_{module_index}_loss',
    810         value=avg_loss,
    811         iteration=name_index,
    812     )

File [/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:116], in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File [/opt/quant/lib/python3.11/site-packages/gptqmodel/quantization/gptq.py:285], in GPTQ.quantize(self, blocksize, percdamp, damp_auto_increment, group_size, actorder, static_groups)
    283 if math.isnan(avg_loss):
    284     print("Losses sum item:", torch.sum(Losses).item())
--> 285     raise ValueError("Quantization failed due to NaN loss")
    287 group_size = group_size if group_size != -1 else self.columns
    289 if static_groups and actorder:

ValueError: Quantization failed due to NaN loss

Apr 02 '25 10:04 it-dainb

@it-dainb Try setting calibration_dataset_concat_size to 0

I also see abnormal loss values at layer 3. it should not be this high. Check your model tokenizer and chat template config if any.

Apr 02 '25 10:04 Qubitium

@it-dainb Did you resolve this error?

Apr 10 '25 16:04 Qubitium

@it-dainb Did you resolve this error?

I also have the same problem and the loss is too big.....

INFO ------------------------------------------------------------------------------------------------------------------------------------------------------ INFO | process | layer | module | loss | samples | damp | time | fwd_time | max_vram | INFO ------------------------------------------------------------------------------------------------------------------------------------------------------ INFO | gptq | 1 | self_attn.k_proj | 0.0000000035 | 13011 | 0.05000 | 2.369 | 14.549 | 1190.92MB, 39.10MB | INFO ------------------------------------------------------------------------------------------------------------------------------------------------------------ INFO | gptq | 1 | self_attn.v_proj | 4564293510100748337152.0000000000 | 13011 | 0.05000 | 2.381 | 14.549 | 1190.90MB, 31.11MB | INFO --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 1 | self_attn.q_proj | 1.1023012079 | 13011 | 0.05000 | 2.424 | 14.549 | 1192.90MB, 31.11MB | INFO ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- WARN Quantization: Module self_attn.o_proj -> Current damp_percent = 0.05000 is too low, auto-incrementing by 0.01000

Jun 25 '25 05:06 Juntongkuki