[BUG] ValueError: Quantization: Failed due to NaN loss
Describe the bug
Quantization failed with a ValueError due to NaN loss when quantizing an 8B model, but the process succeeds for a 1B model.
GPU Info
nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:F5:00.0 Off | N/A |
| 0% 31C P8 17W / 350W | 13005MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
To Reproduce
from gptqmodel import QuantizeConfig, GPTQModel
from gptqmodel.quantization import FORMAT
from gptqmodel import BACKEND
import torch
from datasets import load_dataset
model_id = "sail/Sailor2-8B"
quant_config = QuantizeConfig(
bits=4,
group_size=128,
format=FORMAT.GPTQ,
)
model = GPTQModel.load(
model_id,
torch_dtype=torch.bfloat16,
quantize_config=quant_config,
trust_remote_code=True,
device_map="auto"
)
def get_calib(tokenizer, nsamples, seqlen):
traindata = load_dataset("itdainb/calibrate_vn", split="train").filter(
lambda x: len(x["text"]) >= seqlen)
return [tokenizer(example["text"]) for example in traindata.select(range(nsamples))]
calib_dataset = get_calib(model.tokenizer, 512, 2048)
model.quantize(
calib_dataset,
batch_size=8,
backend = BACKEND.TRITON,
calibration_dataset_concat_size=2048
)
model.model.config._name_or_path = model_id
model.save(f"./models/{model_id.split('/')[-1]}")
Model/Datasets
Model: Sailor2-8B Dataset: itdainb/calibrate_vn
Additional context
|█---------------------------------------| 0:00:00 / 0:00:00 [1/32] 3.1%
INFO - {'layer': 0, 'module': 'self_attn.k_proj', 'loss': '0.21572', 'damp': '0.01000', 'time': '1.411', 'fwd_time': '13.106'}
INFO - {'layer': 0, 'module': 'self_attn.v_proj', 'loss': '0.04058', 'damp': '0.01000', 'time': '1.784', 'fwd_time': '13.106'}
INFO - {'layer': 0, 'module': 'self_attn.q_proj', 'loss': '0.52667', 'damp': '0.01000', 'time': '1.198', 'fwd_time': '13.106'}
INFO - {'layer': 0, 'module': 'self_attn.o_proj', 'loss': '0.03305', 'damp': '0.01000', 'time': '1.175', 'fwd_time': '11.385'}
INFO - {'layer': 0, 'module': 'mlp.up_proj', 'loss': '0.80327', 'damp': '0.01000', 'time': '1.664', 'fwd_time': '12.352'}
INFO - {'layer': 0, 'module': 'mlp.gate_proj', 'loss': '1.07150', 'damp': '0.01000', 'time': '1.518', 'fwd_time': '12.352'}
INFO - {'layer': 0, 'module': 'mlp.down_proj', 'loss': '0.00920', 'damp': '0.01000', 'time': '6.979', 'fwd_time': '40.708'}
Quantizing mlp.down_proj in layer 0 of 31 |██--------------------------------------| 0:01:45 / 0:28:00 [2/32] 6.2%
INFO - {'layer': 1, 'module': 'self_attn.k_proj', 'loss': '0.10450', 'damp': '0.01000', 'time': '1.169', 'fwd_time': '11.278'}
INFO - {'layer': 1, 'module': 'self_attn.v_proj', 'loss': '0.02623', 'damp': '0.01000', 'time': '0.995', 'fwd_time': '11.278'}
INFO - {'layer': 1, 'module': 'self_attn.q_proj', 'loss': '0.38751', 'damp': '0.01000', 'time': '1.066', 'fwd_time': '11.278'}
INFO - {'layer': 1, 'module': 'self_attn.o_proj', 'loss': '0.00495', 'damp': '0.01000', 'time': '1.196', 'fwd_time': '8.891'}
INFO - {'layer': 1, 'module': 'mlp.up_proj', 'loss': '9.03198', 'damp': '0.01000', 'time': '1.743', 'fwd_time': '10.079'}
INFO - {'layer': 1, 'module': 'mlp.gate_proj', 'loss': '21.71189', 'damp': '0.01000', 'time': '1.502', 'fwd_time': '10.079'}
INFO - {'layer': 1, 'module': 'mlp.down_proj', 'loss': '0.01888', 'damp': '0.01000', 'time': '6.962', 'fwd_time': '38.927'}
Quantizing mlp.down_proj in layer 1 of 31 |███-------------------------------------| 0:03:19 / 0:35:22 [3/32] 9.4%
INFO - {'layer': 2, 'module': 'self_attn.k_proj', 'loss': '0.74365', 'damp': '0.01000', 'time': '1.170', 'fwd_time': '11.286'}
INFO - {'layer': 2, 'module': 'self_attn.v_proj', 'loss': '0.10587', 'damp': '0.01000', 'time': '1.056', 'fwd_time': '11.286'}
INFO - {'layer': 2, 'module': 'self_attn.q_proj', 'loss': '1.45657', 'damp': '0.01000', 'time': '1.056', 'fwd_time': '11.286'}
INFO - {'layer': 2, 'module': 'self_attn.o_proj', 'loss': '0.01072', 'damp': '0.01000', 'time': '1.169', 'fwd_time': '8.864'}
INFO - {'layer': 2, 'module': 'mlp.up_proj', 'loss': '13.38118', 'damp': '0.01000', 'time': '1.654', 'fwd_time': '10.117'}
INFO - {'layer': 2, 'module': 'mlp.gate_proj', 'loss': '33.63935', 'damp': '0.01000', 'time': '1.622', 'fwd_time': '10.117'}
INFO - {'layer': 2, 'module': 'mlp.down_proj', 'loss': '0.05232', 'damp': '0.01000', 'time': '6.951', 'fwd_time': '38.891'}
Quantizing mlp.down_proj in layer 2 of 31 |█████-----------------------------------| 0:04:52 / 0:38:56 [4/32] 12.5%
INFO - {'layer': 3, 'module': 'self_attn.k_proj', 'loss': '0.97453', 'damp': '0.01000', 'time': '1.170', 'fwd_time': '11.299'}
INFO - {'layer': 3, 'module': 'self_attn.v_proj', 'loss': '0.18419', 'damp': '0.01000', 'time': '1.061', 'fwd_time': '11.299'}
INFO - {'layer': 3, 'module': 'self_attn.q_proj', 'loss': '2.11738', 'damp': '0.01000', 'time': '1.151', 'fwd_time': '11.299'}
INFO - {'layer': 3, 'module': 'self_attn.o_proj', 'loss': '0.04165', 'damp': '0.01000', 'time': '1.153', 'fwd_time': '8.890'}
INFO - {'layer': 3, 'module': 'mlp.up_proj', 'loss': '49.95827', 'damp': '0.01000', 'time': '1.650', 'fwd_time': '10.110'}
INFO - {'layer': 3, 'module': 'mlp.gate_proj', 'loss': '90.88012', 'damp': '0.01000', 'time': '1.481', 'fwd_time': '10.110'}
Losses sum item: nan
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 4
1 from gptqmodel import BACKEND
3 # increase `batch_size` to match gpu/vram specs to speed up quantization
----> 4 model.quantize(
5 calib_dataset,
6 batch_size=8,
7 backend = BACKEND.TORCH,
8 calibration_dataset_concat_size=2048
9 )
11 model.model.config._name_or_path = model_id
12 model.save(f"./models/{model_id.split('/')[-1]}")
File [/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:116], in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File [/opt/quant/lib/python3.11/site-packages/gptqmodel/models/base.py:800], in BaseGPTQModel.quantize(self, calibration_dataset, calibration_dataset_register, calibration_dataset_concat_size, batch_size, calibration_enable_gpu_cache, tokenizer, logger_board, backend, buffered_fwd, auto_gc, task_name, project_name)
796 static_groups = self.quantize_config.dynamic_get(layer_name, "static_groups", static_groups)
799 # logger.info(f"Quantizing module START: {name}, {gptq[name].shape()}")
--> 800 scale, zero, g_idx, duration, avg_loss, damp_percent = gptq[name].quantize(
801 percdamp=damp_percent,
802 group_size=group_size,
803 actorder=desc_act,
804 static_groups=static_groups,
805 )
806 if task is not None:
807 task.get_logger().report_scalar(
808 title='Quantization Loss',
809 series=f'layer_{module_index}_loss',
810 value=avg_loss,
811 iteration=name_index,
812 )
File [/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:116], in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File [/opt/quant/lib/python3.11/site-packages/gptqmodel/quantization/gptq.py:285], in GPTQ.quantize(self, blocksize, percdamp, damp_auto_increment, group_size, actorder, static_groups)
283 if math.isnan(avg_loss):
284 print("Losses sum item:", torch.sum(Losses).item())
--> 285 raise ValueError("Quantization failed due to NaN loss")
287 group_size = group_size if group_size != -1 else self.columns
289 if static_groups and actorder:
ValueError: Quantization failed due to NaN loss
@it-dainb Try setting calibration_dataset_concat_size to 0
I also see abnormal loss values at layer 3. it should not be this high. Check your model tokenizer and chat template config if any.
@it-dainb Did you resolve this error?
@it-dainb Did you resolve this error?
I also have the same problem and the loss is too big.....
INFO ------------------------------------------------------------------------------------------------------------------------------------------------------
INFO | process | layer | module | loss | samples | damp | time | fwd_time | max_vram |
INFO ------------------------------------------------------------------------------------------------------------------------------------------------------
INFO | gptq | 1 | self_attn.k_proj | 0.0000000035 | 13011 | 0.05000 | 2.369 | 14.549 | 1190.92MB, 39.10MB |
INFO ------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO | gptq | 1 | self_attn.v_proj | 4564293510100748337152.0000000000 | 13011 | 0.05000 | 2.381 | 14.549 | 1190.90MB, 31.11MB |
INFO ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO | gptq | 1 | self_attn.q_proj | 1.1023012079 | 13011 | 0.05000 | 2.424 | 14.549 | 1192.90MB, 31.11MB |
INFO ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WARN Quantization: Module self_attn.o_proj -> Current damp_percent = 0.05000 is too low, auto-incrementing by 0.01000