ktransformers fix OOM when converting gpu weights

Fixes #1635

Nov 24 '25 12:11 ovowei

Summary of Changes

Hello @ovowei, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request directly addresses Out-Of-Memory (OOM) issues that can occur during the GPU weight conversion process, particularly when performing GPTQ quantization. It introduces robust memory management features, allowing users to define specific GPU and CPU memory limits, and provides clear instructions for optimizing these settings. The changes also ensure that the model loading process avoids unsupported disk offloading, thereby improving the stability and reliability of the weight conversion script.

Highlights

Memory Management for GPU Weight Conversion: Introduces new command-line arguments, --max_gpu_memory and --max_cpu_memory, to the convert_gpu_weights.py script, enabling users to precisely control memory allocation during GPTQ quantization and mitigate Out-Of-Memory (OOM) errors.
Enhanced Documentation: Adds a comprehensive 'Memory Management' section to the README.md, providing clear guidance on using the new memory parameters, recommended settings for various GPU VRAM configurations, and practical troubleshooting tips for OOM issues.
Prevention of Disk Offloading: Implements explicit checks and configurations within convert_gpu_weights.py to detect and prevent unsupported disk offloading during device map inference and model loading, ensuring that models are loaded entirely into available GPU/CPU memory as required by llmcompressor.
Improved Argument Parsing Readability: Refactors the argument parsing in convert_gpu_weights.py to enhance readability and provides more descriptive help messages for various parameters, improving the user experience.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Nov 24 '25 12:11 gemini-code-assist[bot]

See the issue. The bug still remains.

Nov 25 '25 03:11 KMSorSMS

@CodeZ-Hao @KMSorSMS I’ve updated the script and verified on my machine(1TB DRAM + L20) that both --force_cpu on and --force_cpu off successfully complete the GLM-4.6 quantization. Could you try again with this pr and see if the issue persists?

Nov 28 '25 08:11 ovowei

@ovowei 我测试了在我的设备上，--max_gpu_memory 12GB仍然CUDA out of memory，而--force_cpu则提示内存不足，可能与我本地内存只有384G有关系？我的配置：单路 Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G内存 CUDA版本12.6

Dec 01 '25 02:12 CodeZ-Hao

@ovowei 我测试了在我的设备上，--max_gpu_memory 12GB仍然CUDA out of memory，而--force_cpu则提示内存不足，可能与我本地内存只有384G有关系？我的配置：单路 Intel(R) Xeon(R) Platinum 8461V + 3090 24G + 384G内存 CUDA版本12.6

我感觉是的，我们考虑加一个 resume 的操作

Dec 01 '25 03:12 KMSorSMS

@CodeZ-Hao @KMSorSMS The memory requirement comes from needing to hold the entire model in CPU RAM during quantization. For GLM-4.6, the full-precision weights are about 357 B, so you need roughly 357 GB × 2 (bf16) ≈ 714 GB of available system memory to run the quantization pipeline successfully.

This script is based on llmcompressor — we simply call the quantization interfaces it provides. For more details on the underlying workflow, please refer to the official guide: https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress

I think features like resume or disk offloading would need to be supported natively by llm-compressor. Because of this, resume support is not on our roadmap right now.

However, we will upload a pre-quantized GLM-4.6 GPTQ model to HuggingFace/ModelScope soon.

Dec 01 '25 03:12 ovowei

ok, tks @ovowei

Dec 01 '25 06:12 CodeZ-Hao