vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature][Quantization] MXFP4 support for MOE models

Open fxmarty-amd opened this issue 6 months ago • 13 comments

This PR follows https://github.com/vllm-project/vllm/pull/16943, and adds the possibility to load MOE models using MXFP4 weights with dynamic per-group MXFP4 quantization for activations.

We did not yet release such models publicly, but expect to release some soon.

At the moment, execution on MI300 runs a simulated scheme where weights are dequantized on the fly, and QDQ is done on activations on the fly, using HIP kernels

Left to do:

  • [x] Add test.
  • [x] Add documentation.
  • [ ] Implement the code path for real mxfp4 * mxfp4 GEMM (maybe in an other PR)
  • [x] Validate sensible eval results for Deepseek R1, llama 4 and llama 405B

fxmarty-amd avatar May 09 '25 07:05 fxmarty-amd

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar May 09 '25 07:05 github-actions[bot]

Can you merge from main to fix pre-commit?

DarkLight1337 avatar May 09 '25 08:05 DarkLight1337

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 13 '25 11:05 mergify[bot]

@mgoin ptal when you get a chance, thanks!

BowenBao avatar May 13 '25 21:05 BowenBao

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 14 '25 10:05 mergify[bot]

@DarkLight1337 @mgoin WDYT? We expect to add a CDNA4 native execution path running GEMMs in mxfp4 in a next PR (implementing https://github.com/fxmarty-amd/vllm/blob/d47af2366b6f73cf55b50299fd534f2210c3c90f/vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py#L100-L101)

fxmarty-amd avatar May 19 '25 14:05 fxmarty-amd

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 23 '25 09:05 mergify[bot]

Does it sound good @mgoin? Tests are passing on https://github.com/vllm-project/vllm/pull/17888/commits/efe7c3ccb84fbe8239c2b22bbf4669abba0fb2bc

fxmarty-amd avatar May 27 '25 14:05 fxmarty-amd

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jun 03 '25 18:06 mergify[bot]

Hi @mgoin, I reran pytest tests/quantization/test_quark.py -s -vvvvv on this branch with main merged today, it looks good. Let me know if this PR needs any more updates :)

AMD is gently starting to release OCP MXFP4 models publicly: https://huggingface.co/collections/amd/quark-quantized-mxfp4-models-68068f8c965d9267a996616d

fxmarty-amd avatar Jun 16 '25 11:06 fxmarty-amd

Thanks for the ping, I'll take a look

mgoin avatar Jun 16 '25 22:06 mgoin

It looks like the kernel moe test is failing related to this PR

[2025-06-16T12:09:58Z] FAILED kernels/moe/test_moe.py::test_fused_moe[False-dtype1-4-6-64-1024-2048-131072] - TypeError: modular_triton_fused_moe() missing 1 required positional argument: 'use_mxfp4_w4a4'

mgoin avatar Jun 16 '25 22:06 mgoin

I think the CI looks good, apart from some bitsandbytes tests that may be unrelated:

[2025-06-17T16:53:16Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_4bit_bnb_model[facebook/opt-125m-quantize opt model inflight][0m - AssertionError: function <function test_load_4bit_bnb_model at 0x7f4a4e55ec00> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'facebook/opt-125m', 'description': 'quantize opt model inflight'}
[2025-06-17T16:53:16Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_4bit_bnb_model[mistralai/Mistral-7B-Instruct-v0.3-quantize inflight model with both HF and Mistral format weights][0m - AssertionError: function <function test_load_4bit_bnb_model at 0x7f4a4e55ec00> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'mistralai/Mistral-7B-Instruct-v0.3', 'description': 'quantize inflight model with both HF and Mistral format weights'}
[2025-06-17T16:53:16Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_pre_quant_4bit_bnb_model[PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed-read pre-quantized 4-bit FP4 model][0m - AssertionError: function <function test_load_pre_quant_4bit_bnb_model at 0x7f4a42f18e00> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed', 'description': 'read pre-quantized 4-bit FP4 model'}
[2025-06-17T16:53:16Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_pre_quant_4bit_bnb_model[poedator/opt-125m-bnb-4bit-read pre-quantized 4-bit NF4 opt model][0m - AssertionError: function <function test_load_pre_quant_4bit_bnb_model at 0x7f4a42f18e00> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'poedator/opt-125m-bnb-4bit', 'description': 'read pre-quantized 4-bit NF4 opt model'}
[2025-06-17T16:53:16Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_8bit_bnb_model[meta-llama/Llama-Guard-3-8B-INT8-read pre-quantized llama 8-bit model][0m - AssertionError: function <function test_load_8bit_bnb_model at 0x7f4a42f19760> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'meta-llama/Llama-Guard-3-8B-INT8', 'description': 'read pre-quantized llama 8-bit model'}
[2025-06-17T16:53:16Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_8bit_bnb_model[yec019/fbopt-350m-8bit-read pre-quantized 8-bit opt model][0m - AssertionError: function <function test_load_8bit_bnb_model at 0x7f4a42f19760> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'yec019/fbopt-350m-8bit', 'description': 'read pre-quantized 8-bit opt model'}
[2025-06-17T16:53:16Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight][0m - AssertionError: function <function test_4bit_bnb_embedding_model at 0x7f4a42f19c60> failed when called with args () and kwargs {'model_name': 'intfloat/e5-mistral-7b-instruct', 'description': 'quantize embedding model inflight', 'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'dtype': 'half'}

fxmarty-amd avatar Jun 18 '25 09:06 fxmarty-amd

Hi @bnellnm, I'll fix conflicts and run tests again, I'll ping you again for review :)

fxmarty-amd avatar Jun 27 '25 08:06 fxmarty-amd

Hi @bnellnm, I reran tests locally, it looks good. Concerning the CI:

We have

[2025-06-27T14:03:59Z] FAILED quantization/test_fp8.py::test_scaled_fp8_quant[dtype0] - AssertionError
[2025-06-27T14:03:59Z] FAILED quantization/test_fp8.py::test_scaled_fp8_quant[dtype1] - AssertionError

failing, but this is already failing on https://github.com/vllm-project/vllm/pull/20045 from a few days ago (see https://buildkite.com/vllm/ci/builds/22676#0197a8d1-7fc7-4a19-bb6d-c1664f589dc9), so this does not look related. I am not sure which PR broke these tests, one would need to bisect.

We also have

[2025-06-27T13:51:13Z] FAILED models/test_initialization.py::test_can_initialize[MiniMaxText01ForCausalLM] - pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
[2025-06-27T13:51:13Z]   Value error, The checkpoint you are trying to load has model type `minimax` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

which does not look related, and I was not getting this a week ago.

We also have some bnb tests failing which I don't think are related but I could investigate that:

[2025-06-27T14:03:59Z] FAILED quantization/test_bitsandbytes.py::test_load_4bit_bnb_model[facebook/opt-125m-quantize opt model inflight] - AssertionError: function <function test_load_4bit_bnb_model at 0x7f256c129bc0> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'facebook/opt-125m', 'description': 'quantize opt model inflight'}
[2025-06-27T14:03:59Z] FAILED quantization/test_bitsandbytes.py::test_load_4bit_bnb_model[mistralai/Mistral-7B-Instruct-v0.3-quantize inflight model with both HF and Mistral format weights] - AssertionError: function <function test_load_4bit_bnb_model at 0x7f256c129bc0> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'mistralai/Mistral-7B-Instruct-v0.3', 'description': 'quantize inflight model with both HF and Mistral format weights'}
[2025-06-27T14:03:59Z] FAILED quantization/test_bitsandbytes.py::test_load_pre_quant_4bit_bnb_model[PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed-read pre-quantized 4-bit FP4 model] - AssertionError: function <function test_load_pre_quant_4bit_bnb_model at 0x7f251bb3b2e0> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed', 'description': 'read pre-quantized 4-bit FP4 model'}
[2025-06-27T14:03:59Z] FAILED quantization/test_bitsandbytes.py::test_load_pre_quant_4bit_bnb_model[poedator/opt-125m-bnb-4bit-read pre-quantized 4-bit NF4 opt model] - AssertionError: function <function test_load_pre_quant_4bit_bnb_model at 0x7f251bb3b2e0> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'poedator/opt-125m-bnb-4bit', 'description': 'read pre-quantized 4-bit NF4 opt model'}
[2025-06-27T14:03:59Z] FAILED quantization/test_bitsandbytes.py::test_load_8bit_bnb_model[meta-llama/Llama-Guard-3-8B-INT8-read pre-quantized llama 8-bit model] - AssertionError: function <function test_load_8bit_bnb_model at 0x7f251bb3bc40> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'meta-llama/Llama-Guard-3-8B-INT8', 'description': 'read pre-quantized llama 8-bit model'}
[2025-06-27T14:03:59Z] FAILED quantization/test_bitsandbytes.py::test_load_8bit_bnb_model[yec019/fbopt-350m-8bit-read pre-quantized 8-bit opt model] - AssertionError: function <function test_load_8bit_bnb_model at 0x7f251bb3bc40> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'yec019/fbopt-350m-8bit', 'description': 'read pre-quantized 8-bit opt model'}
[2025-06-27T14:03:59Z] FAILED quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight] - AssertionError: function <function test_4bit_bnb_embedding_model at 0x7f251baa4040> failed when called with args () and kwargs {'model_name': 'intfloat/e5-mistral-7b-instruct', 'description': 'quantize embedding model inflight', 'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'dtype': 'half'}

fxmarty-amd avatar Jun 27 '25 15:06 fxmarty-amd

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jul 03 '25 06:07 mergify[bot]

Hi @bnellnm, I addressed your comments and also made this compatible with the recent changes in vllm for dynamo/inductor, guarding mxfp4 dequantization & QDQ in custom ops.

Let me know if this looks good!

fxmarty-amd avatar Jul 08 '25 10:07 fxmarty-amd

@bnellnm concerning the CI, the failing tests seem to be the previous bitsandbytes tests that were failing some weeks ago as well, I think it is unrelated:

[2025-07-08T14:45:22Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_4bit_bnb_model[facebook/opt-125m-quantize opt model inflight][0m - AssertionError: function <function test_load_4bit_bnb_model at 0x7fb0ce2c1620> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'facebook/opt-125m', 'description': 'quantize opt model inflight'}
[2025-07-08T14:45:22Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_4bit_bnb_model[mistralai/Mistral-7B-Instruct-v0.3-quantize inflight model with both HF and Mistral format weights][0m - AssertionError: function <function test_load_4bit_bnb_model at 0x7fb0ce2c1620> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'mistralai/Mistral-7B-Instruct-v0.3', 'description': 'quantize inflight model with both HF and Mistral format weights'}
[2025-07-08T14:45:22Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_pre_quant_4bit_bnb_model[PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed-read pre-quantized 4-bit FP4 model][0m - AssertionError: function <function test_load_pre_quant_4bit_bnb_model at 0x7fb0a5c0c5e0> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed', 'description': 'read pre-quantized 4-bit FP4 model'}
[2025-07-08T14:45:22Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_pre_quant_4bit_bnb_model[poedator/opt-125m-bnb-4bit-read pre-quantized 4-bit NF4 opt model][0m - AssertionError: function <function test_load_pre_quant_4bit_bnb_model at 0x7fb0a5c0c5e0> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'poedator/opt-125m-bnb-4bit', 'description': 'read pre-quantized 4-bit NF4 opt model'}
[2025-07-08T14:45:22Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_8bit_bnb_model[meta-llama/Llama-Guard-3-8B-INT8-read pre-quantized llama 8-bit model][0m - AssertionError: function <function test_load_8bit_bnb_model at 0x7fb0a5c0cf40> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'meta-llama/Llama-Guard-3-8B-INT8', 'description': 'read pre-quantized llama 8-bit model'}
[2025-07-08T14:45:22Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_load_8bit_bnb_model[yec019/fbopt-350m-8bit-read pre-quantized 8-bit opt model][0m - AssertionError: function <function test_load_8bit_bnb_model at 0x7fb0a5c0cf40> failed when called with args () and kwargs {'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'model_name': 'yec019/fbopt-350m-8bit', 'description': 'read pre-quantized 8-bit opt model'}
[2025-07-08T14:45:22Z] [31mFAILED[0m quantization/test_bitsandbytes.py::[1mtest_4bit_bnb_embedding_model[half-intfloat/e5-mistral-7b-instruct-quantize embedding model inflight][0m - AssertionError: function <function test_4bit_bnb_embedding_model at 0x7fb0a5c0d300> failed when called with args () and kwargs {'model_name': 'intfloat/e5-mistral-7b-instruct', 'description': 'quantize embedding model inflight', 'hf_runner': <class 'tests.conftest.HfRunner'>, 'vllm_runner': <class 'tests.conftest.VllmRunner'>, 'example_prompts': ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\n', 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\n', 'Describe the basic components of a neural network and how it can be trained.\n', 'Write a short story about a robot that dreams for the first time.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\n', "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\n"], 'dtype': 'half'}

fxmarty-amd avatar Jul 08 '25 14:07 fxmarty-amd

Thanks, I'll take a look now. Bill is OOO for a bit

mgoin avatar Jul 09 '25 00:07 mgoin

@mgoin I reran tests in test_quark.py and kernels/moe/test_mxfp4_moe.py, looks good.

fxmarty-amd avatar Jul 09 '25 13:07 fxmarty-amd