GPTQModel [BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers

Describe the bug

From my dmesg output, it is evident that the GPTQ Python process (PID 1179327) was killed by the kernel due to the system running out of memory (Out of Memory, OOM).

[659992.292163] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/default/dbfe5de8b17117d4ce1260b30ec75b84a6fb13e34205aa8e361b6e086648f779,task=python3,pid=1179327,uid=0
[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998
[659992.468190] systemd[1]: [email protected]: Succeeded.
[659992.468649] systemd[1]: rdma-ndd.service: Main process exited, code=killed, status=9/KILL
[659992.478226] systemd[1]: rdma-ndd.service: Failed with result 'signal'.
[659992.487228] systemd[1]: [email protected]: Succeeded.
[659992.487563] systemd[1]: AssistDaemon.service: Main process exited, code=killed, status=9/KILL
[659992.497382] systemd[1]: AssistDaemon.service: Failed with result 'signal'.
[659992.506469] systemd[1]: pingmesh-lingjun-agent.service: Failed with result 'signal'.
[659992.516165] systemd[1]: systemd-logind.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[659992.516548] systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 7.
[659992.516556] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[660026.497156] oom_reaper: reaped process 1179327 (python3), now anon-rss:0kB, file-rss:79868kB, shmem-rss:17288kB

GPU Info

NVIDIA H20

Software Info

Show output of:

pip show gptqmodel torch transformers accelerate triton

Name: gptqmodel
Version: 2.0.0.dev0
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, datasets, device-smi, hf_transfer, huggingface_hub, lm-eval, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by: 
---
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, compressed-tensors, flashinfer-python, gptqmodel, lm_eval, outlines, peft, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.49.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: compressed-tensors, gptqmodel, lm_eval, peft, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.4.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock
Required-by: torch

Model

DeepSeek-R1-BF16 from huggingface

To Reproduce

quant_config = QuantizeConfig(bits = 8, group_size = 128, desc_act = False)

model = GPTQModel.load(model_path, quant_config, device_map='auto', device = "cuda",
                           trust_remote_code=True, low_cpu_mem_usage=True)
    
model.quantize(calibration_dataset, calibration_dataset_concat_size = 1024, buffered_fwd = True, batch_size = 2)

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

When I asked GPT-4o, its reply was as follows:

Possible reasons:

Memory leak: Your Python program might have a memory leak, causing it to continuously consume memory during processing without releasing some objects that are no longer needed.

Handling large amounts of data: The program may be loading or processing a large amount of data, and if the memory requirements exceed the available memory of the system, it will trigger an OOM (Out of Memory) error.

Concurrent operations: If you are dealing with multiple processes or threads, it could lead to increased memory requirements.

Summary

OOM (Out of Memory) issues that lead to process termination due to insufficient memory are common problems, especially in tasks involving large datasets. It is recommended that you start by checking your code and optimizing memory usage, looking for potential memory leaks or improving the way memory is utilized.

Feb 27 '25 03:02 ShiningMaker

@ShiningMaker We need the following:

How much VRAM do you have?
How much CPU RAM do you have?
How much CPU swap do you have?

For DeepSeek V3/R1 BF16, you should have 1.5TB of CPU memory to avoid OOM.

Feb 27 '25 04:02 Qubitium

@ShiningMaker I noticed the OOM process had 2TB of memory which includes mmap/disk memory. What is the max cpu memory in your vm or computer instance? 2TB should be more than enough even for DeepSeek R1.

[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998

Feb 27 '25 05:02 Qubitium

@ShiningMaker I noticed the OOM process had 2TB of memory which includes mmap/disk memory. What is the max cpu memory in your vm or computer instance? 2TB should be more than enough even for DeepSeek R1.
[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

              total        used        free      shared  buff/cache   available
Mem:           2.0Ti       1.6Ti        12Gi        24Mi       402Gi       403Gi
Swap:             0B          0B          0B

Feb 27 '25 06:02 ShiningMaker

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

Are you saying while packing, another, non-quant related INT8 R1 model was loaded into memory by accident? Is this correct? That's bad news.

Also, I see that there are 400GB of buffer cache (disk cache). You can free those by calling linux cli free before and during the packing code to see if you can release more memory. Linux does not release buffers as expected vs MacOS or other systems.

Feb 27 '25 07:02 Qubitium

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

Are you saying while packing, another, non-quant related INT8 R1 model was loaded into memory by accident? Is this correct? That's bad news.

Also, I see that there are 400GB of buffer cache (disk cache). You can free those by calling linux cli free before and during the packing code to see if you can release more memory. Linux does not release buffers as expected vs MacOS or other systems.

No, no, no, what I mean is that the int8 model generated during the packing process coexists with the original bf16 model (is this situation possible?). And for release memory, will explicitly calling gc.collect() in the packing code of gptqmodel work?

Feb 27 '25 08:02 ShiningMaker

@ShiningMaker You can try and test calling torch_empty_cache from our utils, it will release both gpu and cpu memory and call gc at the same time.

Feb 27 '25 23:02 Qubitium

@Qubitium I also encountered the same problem. I believe that after packing a layer, deleting the corresponding FP32 fake quantized weights and releasing CPU memory could help when packing large models like DeepSeek-V3. However, I'm not sure how to achieve this.

@ShiningMaker You can try and test calling torch_empty_cache from our utils, it will release both gpu and cpu memory and call gc at the same time.

Mar 03 '25 14:03 1773226512

i encountered the same OOM problem when transfer DeepSeek-R1-BF16 weight to 4bit using this script on A800(40g) machine

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "/DeepSeek-R1-BF16"
quant_path = "/DeepSeek-R1-4bit"

calibration_dataset = load_dataset(
    "allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128) # quantization config
model = GPTQModel.load(model_id, quant_config) # load model

model.quantize(calibration_dataset, batch_size=2) # quantize
model.save(quant_path) # save model

error log:

INFO  Kernel: loaded -> `[]`
INFO  Kernel: Auto-selection: adding candidate `ExllamaQuantLinear`
Quantizing layer 0 of 60 [0 of 60] █-------------------------------------------------------------------------------------------------------------| 0:00:00 / 0:00:00 [1/61] 1.6%T    raceback (most recent call last):
  File "/root/GPTQModel4.py", line 18, in <module>
    model.quantize(calibration_dataset, batch_size=8)
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/gptqmodel/models/base.py", line 445, in quantize
    return module_looper.loop(
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/gptqmodel/looper/module_looper.py", line 317, in loop
    module(*layer_input) if is_lm_head_module else module(*layer_input,
                                                   ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-R1-bf16/modeling_deepseek.py", line 1203, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-R1-bf16/modeling_deepseek.py", line 817, in forward
    torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.58 GiB. GPU 0 has a total capacity of 39.56 GiB of which 29.04 GiB is free. Including non-PyTorch memory, this p    rocess has 10.52 GiB memory in use. Of the allocated memory 9.86 GiB is allocated by PyTorch, and 172.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated     memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/n    otes/cuda.html#environment-variables)

Mar 12 '25 03:03 liu316484231

@Qubitium Hello, I used the dynamic method to retain the last 10 model.layers without quantization, and at this point, there are no more out-of-memory (OOM) issues occurring. Now, I want to generate a model where only the last 5 model.layers are not quantized based on this foundation. Can I perform the continue-quant operation on the previous model, or do I have to quantize from the first model.layer? If it is possible, what quantization parameters should I use during the quantization process?

I look forward to your reply! By the way, I want to try quantization using qqq. Is there a basic example of using GPTQModel to quantize a qqq-w4a8 model?

Mar 28 '25 04:03 ShiningMaker

@ShiningMaker Continue, and restart quant from a specific layer is technically do-able but not implemented. It would be a great feature/PR to have.

For quant restart to happen, due to restarting with different config for later layer or due to OOM, we need to:

save every module's captured hookded input. Calibration data is feed to embedding and passed to each layer/module in top-down as normal input/output. We need to save this to local file, per module.
on model load, reload the "partial/full quantized progress" and restart/start on a layer starting index.

Also, QQQ is experimental right now. Test it on a small for now. The quality for now is not great vs gptq. Not sure if it's regression of how we implemented it.

Mar 28 '25 04:03 Qubitium

i encountered the same OOM problem when transfer DeepSeek-R1-BF16 weight to 4bit using this script on A800(40g) machine

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "/DeepSeek-R1-BF16"
quant_path = "/DeepSeek-R1-4bit"

calibration_dataset = load_dataset(
    "allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128) # quantization config
model = GPTQModel.load(model_id, quant_config) # load model

model.quantize(calibration_dataset, batch_size=2) # quantize
model.save(quant_path) # save model

error log:

INFO  Kernel: loaded -> `[]`
INFO  Kernel: Auto-selection: adding candidate `ExllamaQuantLinear`
Quantizing layer 0 of 60 [0 of 60] █-------------------------------------------------------------------------------------------------------------| 0:00:00 / 0:00:00 [1/61] 1.6%T    raceback (most recent call last):
  File "/root/GPTQModel4.py", line 18, in <module>
    model.quantize(calibration_dataset, batch_size=8)
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/gptqmodel/models/base.py", line 445, in quantize
    return module_looper.loop(
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/gptqmodel/looper/module_looper.py", line 317, in loop
    module(*layer_input) if is_lm_head_module else module(*layer_input,
                                                   ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-R1-bf16/modeling_deepseek.py", line 1203, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-R1-bf16/modeling_deepseek.py", line 817, in forward
    torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.58 GiB. GPU 0 has a total capacity of 39.56 GiB of which 29.04 GiB is free. Including non-PyTorch memory, this p    rocess has 10.52 GiB memory in use. Of the allocated memory 9.86 GiB is allocated by PyTorch, and 172.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated     memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/n    otes/cuda.html#environment-variables)

@Qubitium same issue, any solution? I am using H20 and the DRAM is above 1.5TB.

Apr 01 '25 02:04 laixinn