transformers icon indicating copy to clipboard operation
transformers copied to clipboard

git-oss-20b will not properly load with MXFP4 quantization even though Triton version satisfys.

Open nickeisenberg opened this issue 1 month ago • 0 comments

System Info

$ hf env

Copy-and-paste the text below in your GitHub issue.

- huggingface_hub version: 0.36.0
- Platform: Linux-4.18.0-553.83.1.1toss.t4.x86_64-x86_64-with-glibc2.28
- Python version: 3.12.11
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /g/g11/eisenbnt/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.9.1
- Jinja2: 3.1.6
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 2.3.5
- pydantic: N/A
- aiohttp: 3.13.2
- hf_xet: 1.2.0
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /g/g11/eisenbnt/.cache/huggingface/hub
- HF_ASSETS_CACHE: /g/g11/eisenbnt/.cache/huggingface/assets
- HF_TOKEN_PATH: /g/g11/eisenbnt/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /g/g11/eisenbnt/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_DISABLE_XET: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
((dev) ) eisenbnt@matrix9:~
$ nvidia-smi
Mon Dec  8 17:00:32 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
| N/A   35C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
((dev) ) eisenbnt@matrix9:~
$
((dev) ) eisenbnt@matrix9:~
$ pip show triton
Name: triton
Version: 3.5.1
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License:
Location: /usr/workspace/eisenbnt/.venvman/envs/3.12/dev/lib64/python3.12/site-packages
Requires:
Required-by: torch
((dev) ) eisenbnt@matrix9:~
$

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

>>> from transformers.models.gpt_oss.modeling_gpt_oss import GptOssForCausalLM
>>> def get_gpt_oss(device):
...     model = GptOssForCausalLM.from_pretrained(
...         openai/gpt-oss-20b,
...     )
...     return model.to(device)
...
>>> model = get_gpt_oss("cuda:0")
MXFP4 quantization requires Triton and kernels installed: CUDA requires Triton >= 3.4.0, XPU requires Triton
>= 3.5.0, we will default to dequantizing the model to bf16
Loading checkpoint shards: 100%|███████████████████████████████████████████████| 3/3 [00:19<00:00,  6.56s/it]

Expected behavior

Is there anything special I need to do to get this working? Thank you!

nickeisenberg avatar Dec 09 '25 01:12 nickeisenberg