[BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers
Describe the bug
From my dmesg output, it is evident that the GPTQ Python process (PID 1179327) was killed by the kernel due to the system running out of memory (Out of Memory, OOM).
[659992.292163] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/default/dbfe5de8b17117d4ce1260b30ec75b84a6fb13e34205aa8e361b6e086648f779,task=python3,pid=1179327,uid=0
[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998
[659992.468190] systemd[1]: [email protected]: Succeeded.
[659992.468649] systemd[1]: rdma-ndd.service: Main process exited, code=killed, status=9/KILL
[659992.478226] systemd[1]: rdma-ndd.service: Failed with result 'signal'.
[659992.487228] systemd[1]: [email protected]: Succeeded.
[659992.487563] systemd[1]: AssistDaemon.service: Main process exited, code=killed, status=9/KILL
[659992.497382] systemd[1]: AssistDaemon.service: Failed with result 'signal'.
[659992.506469] systemd[1]: pingmesh-lingjun-agent.service: Failed with result 'signal'.
[659992.516165] systemd[1]: systemd-logind.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[659992.516548] systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 7.
[659992.516556] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[660026.497156] oom_reaper: reaped process 1179327 (python3), now anon-rss:0kB, file-rss:79868kB, shmem-rss:17288kB
GPU Info
NVIDIA H20
Software Info
Show output of:
pip show gptqmodel torch transformers accelerate triton
Name: gptqmodel
Version: 2.0.0.dev0
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, datasets, device-smi, hf_transfer, huggingface_hub, lm-eval, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by:
---
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, compressed-tensors, flashinfer-python, gptqmodel, lm_eval, outlines, peft, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.49.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: compressed-tensors, gptqmodel, lm_eval, peft, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.4.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License:
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock
Required-by: torch
Model
DeepSeek-R1-BF16 from huggingface
To Reproduce
quant_config = QuantizeConfig(bits = 8, group_size = 128, desc_act = False)
model = GPTQModel.load(model_path, quant_config, device_map='auto', device = "cuda",
trust_remote_code=True, low_cpu_mem_usage=True)
model.quantize(calibration_dataset, calibration_dataset_concat_size = 1024, buffered_fwd = True, batch_size = 2)
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
When I asked GPT-4o, its reply was as follows:
Possible reasons:
Memory leak: Your Python program might have a memory leak, causing it to continuously consume memory during processing without releasing some objects that are no longer needed.
Handling large amounts of data: The program may be loading or processing a large amount of data, and if the memory requirements exceed the available memory of the system, it will trigger an OOM (Out of Memory) error.
Concurrent operations: If you are dealing with multiple processes or threads, it could lead to increased memory requirements.
Summary
OOM (Out of Memory) issues that lead to process termination due to insufficient memory are common problems, especially in tasks involving large datasets. It is recommended that you start by checking your code and optimizing memory usage, looking for potential memory leaks or improving the way memory is utilized.
@ShiningMaker We need the following:
- How much VRAM do you have?
- How much CPU RAM do you have?
- How much CPU swap do you have?
For DeepSeek V3/R1 BF16, you should have 1.5TB of CPU memory to avoid OOM.
@ShiningMaker I noticed the OOM process had 2TB of memory which includes mmap/disk memory. What is the max cpu memory in your vm or computer instance? 2TB should be more than enough even for DeepSeek R1.
[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998
@ShiningMaker I noticed the OOM process had 2TB of memory which includes mmap/disk memory. What is the max cpu memory in your vm or computer instance? 2TB should be more than enough even for DeepSeek R1.
[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998
I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.
total used free shared buff/cache available
Mem: 2.0Ti 1.6Ti 12Gi 24Mi 402Gi 403Gi
Swap: 0B 0B 0B
I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.
Are you saying while packing, another, non-quant related INT8 R1 model was loaded into memory by accident? Is this correct? That's bad news.
Also, I see that there are 400GB of buffer cache (disk cache). You can free those by calling linux cli free before and during the packing code to see if you can release more memory. Linux does not release buffers as expected vs MacOS or other systems.
I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.
Are you saying while
packing, another, non-quant related INT8 R1 model was loaded into memory by accident? Is this correct? That's bad news.Also, I see that there are 400GB of buffer cache (disk cache). You can free those by calling
linuxclifreebefore and during the packing code to see if you can release more memory. Linux does not release buffers as expected vs MacOS or other systems.
No, no, no, what I mean is that the int8 model generated during the packing process coexists with the original bf16 model (is this situation possible?). And for release memory, will explicitly calling gc.collect() in the packing code of gptqmodel work?
@ShiningMaker You can try and test calling torch_empty_cache from our utils, it will release both gpu and cpu memory and call gc at the same time.
@Qubitium I also encountered the same problem. I believe that after packing a layer, deleting the corresponding FP32 fake quantized weights and releasing CPU memory could help when packing large models like DeepSeek-V3. However, I'm not sure how to achieve this.
@ShiningMaker You can try and test calling
torch_empty_cachefrom our utils, it will release both gpu and cpu memory and call gc at the same time.
i encountered the same OOM problem when transfer DeepSeek-R1-BF16 weight to 4bit using this script on A800(40g) machine
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "/DeepSeek-R1-BF16"
quant_path = "/DeepSeek-R1-4bit"
calibration_dataset = load_dataset(
"allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4, group_size=128) # quantization config
model = GPTQModel.load(model_id, quant_config) # load model
model.quantize(calibration_dataset, batch_size=2) # quantize
model.save(quant_path) # save model
error log:
INFO Kernel: loaded -> `[]`
INFO Kernel: Auto-selection: adding candidate `ExllamaQuantLinear`
Quantizing layer 0 of 60 [0 of 60] █-------------------------------------------------------------------------------------------------------------| 0:00:00 / 0:00:00 [1/61] 1.6%T raceback (most recent call last):
File "/root/GPTQModel4.py", line 18, in <module>
model.quantize(calibration_dataset, batch_size=8)
File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/gptqmodel/models/base.py", line 445, in quantize
return module_looper.loop(
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/gptqmodel/looper/module_looper.py", line 317, in loop
module(*layer_input) if is_lm_head_module else module(*layer_input,
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-R1-bf16/modeling_deepseek.py", line 1203, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-R1-bf16/modeling_deepseek.py", line 817, in forward
torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.58 GiB. GPU 0 has a total capacity of 39.56 GiB of which 29.04 GiB is free. Including non-PyTorch memory, this p rocess has 10.52 GiB memory in use. Of the allocated memory 9.86 GiB is allocated by PyTorch, and 172.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/n otes/cuda.html#environment-variables)
@Qubitium Hello, I used the dynamic method to retain the last 10 model.layers without quantization, and at this point, there are no more out-of-memory (OOM) issues occurring. Now, I want to generate a model where only the last 5 model.layers are not quantized based on this foundation. Can I perform the continue-quant operation on the previous model, or do I have to quantize from the first model.layer? If it is possible, what quantization parameters should I use during the quantization process?
I look forward to your reply! By the way, I want to try quantization using qqq. Is there a basic example of using GPTQModel to quantize a qqq-w4a8 model?
@ShiningMaker Continue, and restart quant from a specific layer is technically do-able but not implemented. It would be a great feature/PR to have.
For quant restart to happen, due to restarting with different config for later layer or due to OOM, we need to:
- save every module's captured hookded input. Calibration data is feed to embedding and passed to each layer/module in top-down as normal input/output. We need to save this to local file, per module.
- on model load, reload the "partial/full quantized progress" and restart/start on a layer starting index.
Also, QQQ is experimental right now. Test it on a small for now. The quality for now is not great vs gptq. Not sure if it's regression of how we implemented it.
i encountered the same OOM problem when transfer DeepSeek-R1-BF16 weight to 4bit using this script on A800(40g) machine
from datasets import load_dataset from gptqmodel import GPTQModel, QuantizeConfig model_id = "/DeepSeek-R1-BF16" quant_path = "/DeepSeek-R1-4bit" calibration_dataset = load_dataset( "allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split="train" ).select(range(1024))["text"] quant_config = QuantizeConfig(bits=4, group_size=128) # quantization config model = GPTQModel.load(model_id, quant_config) # load model model.quantize(calibration_dataset, batch_size=2) # quantize model.save(quant_path) # save modelerror log:
INFO Kernel: loaded -> `[]` INFO Kernel: Auto-selection: adding candidate `ExllamaQuantLinear` Quantizing layer 0 of 60 [0 of 60] █-------------------------------------------------------------------------------------------------------------| 0:00:00 / 0:00:00 [1/61] 1.6%T raceback (most recent call last): File "/root/GPTQModel4.py", line 18, in <module> model.quantize(calibration_dataset, batch_size=8) File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/gptqmodel/models/base.py", line 445, in quantize return module_looper.loop( ^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/gptqmodel/looper/module_looper.py", line 317, in loop module(*layer_input) if is_lm_head_module else module(*layer_input, ^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-R1-bf16/modeling_deepseek.py", line 1203, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/gptq/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/DeepSeek-R1-bf16/modeling_deepseek.py", line 817, in forward torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.58 GiB. GPU 0 has a total capacity of 39.56 GiB of which 29.04 GiB is free. Including non-PyTorch memory, this p rocess has 10.52 GiB memory in use. Of the allocated memory 9.86 GiB is allocated by PyTorch, and 172.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/n otes/cuda.html#environment-variables)
@Qubitium same issue, any solution? I am using H20 and the DRAM is above 1.5TB.