vllm [Model] Add PLaMo2

This PR adds support for PLaMo2, specifically PLaMo2 1B and PLaMo2 8B. PLaMo2 is a hybrid mamba2 architecture featuring sliding window attention.

In this PR, we have created the inference architecture for the PLaMo2 model.

Please note that some functionalities such as PP/TP, quantization, and LoRA are not yet supported in this PR. We plan to address these features in subsequent follow-up PRs.

Mar 06 '25 03:03 Alnusjaponica

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Mar 06 '25 03:03 github-actions[bot]

This is my first time submitting a PR to this repository, and I just joined the Slack. I believe I do not have permission to unblock additional CIs, so I would appreciate it if you could add me to the Buildkite org. If I have missed anything or if there are areas for improvement, please do let me know.

Mar 06 '25 03:03 Alnusjaponica

@tlrmchlsmth can you do a quick pass to check that the implementation of this model fits our architecture? No need to test for correctness since they are the model vendor

Mar 19 '25 15:03 DarkLight1337

Thank you for your review. I have corrected the parts that could be fixed immediately, and I will address the other comments within a few days.

Mar 20 '25 03:03 Alnusjaponica

I'm not getting any output generated from following simple script (on an H100). Could you take a look?

Somehow float16 is used by default in vLLM, but our model does not support it. Could you specify dtype=bfloat16 or dtype=float32? We would appreciate it if you could also tell us how to set the default dtype in the vLLM implementaion.

Mar 25 '25 05:03 Alnusjaponica

I'm not getting any output generated from following simple script (on an H100). Could you take a look?

Somehow float16 is used by default in vLLM, but our model does not support it. Could you specify dtype=bfloat16 or dtype=float32? We would appreciate it if you could also tell us how to set the default dtype in the vLLM implementaion.

setting dtype to bf16 works for me.

We used to do the following for gemma models. That could be an option in this case https://github.com/vllm-project/vllm/blob/8b3e94a3575a0f36fe22b55d2a584b411e3b83f1/vllm/config.py#L2536-L2541

@DarkLight1337 do you have any better ideas? (Also do you know why that snippet got removed in #14858?)

@Alnusjaponica do you know why fp16 doesn't work? Likely we are over or underflowing somewhere if this is happening, which may be possible to avoid. In any case we should log a warning if someone tries to run this model with fp16 dtype

Mar 31 '25 14:03 tlrmchlsmth

@DarkLight1337 do you have any better ideas? (Also do you know why that snippet got removed in https://github.com/vllm-project/vllm/pull/14858?)

The proper way to do it would be to update the config.json on HF side with a torch_dtype field. After setting that field, the "tensor type" on the HF model card should be shown as BF16, and using AutoModel.from_pretrained(..., torch_dtype="auto") should resolve to the correct dtype. That piece of code was removed because vLLM can now correctly read the dtype from the HF config.

Mar 31 '25 14:03 DarkLight1337

@DarkLight1337 do you have any better ideas? (Also do you know why that snippet got removed in #14858?)

The proper way to do it would be to update the config.json on HF side with a torch_dtype field. After setting that field, the "tensor type" on the HF model card should be shown as BF16, and using AutoModel.from_pretrained(..., torch_dtype="auto") should resolve to the correct dtype. That piece of code was removed because vLLM can now correctly read the dtype from the HF config.

I think the actual weights are stored as fp32 though - is that a problem?

Mar 31 '25 16:03 tlrmchlsmth

I think that using AutoModel.from_pretrained without setting torch_dtype should load the weights in the original dtype, you can try it.

Mar 31 '25 16:03 DarkLight1337

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Alnusjaponica.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 01 '25 09:04 mergify[bot]

Thanks for your suggestions. It seems the weight's dtype is downcasted here: https://github.com/vllm-project/vllm/blob/0a298ea4181a9b324afdb24876560b7e98e69bba/vllm/config.py#L2649 So, I am decided to edit vllm/vllm/config.py to use bfloat16 by default.

@Alnusjaponica do you know why fp16 doesn't work? Likely we are over or underflowing somewhere if this is happening, which may be possible to avoid. In any case we should log a warning if someone tries to run this model with fp16 dtype

We think you're right, and it seems numerically unstable in the attention layer. As you mentioned, it may be possible to avoid this issue, but the implementation cost might not be worth it. For now, I have added a warning log when both the plamo2 model and float16 are specified.

I also noticed that this PR is affected by https://github.com/vllm-project/vllm/issues/15238 after I merge the latest changes from the main branch, so I need to specify VLLM_ATTENTION_BACKEND=XFORMERS.

Apr 01 '25 11:04 Alnusjaponica

I also noticed that this PR is affected by #15238 after I merge the latest changes from the main branch, so I need to specify VLLM_ATTENTION_BACKEND=XFORMERS.

Does this still happen even after recompiling?

We think you're right, and it seems numerically unstable in the attention layer. As you mentioned, it may be possible to avoid this issue, but the implementation cost might not be worth it. For now, I have added a warning log when both the plamo2 model and float16 are specified.

That sounds good to me.

Apr 01 '25 20:04 tlrmchlsmth

Does this still happen even after recompiling?

It resolved after I recompiled. Thanks a lot.

We're going to fix PlamoConfig.model_type in modeling_plamo.py to "plamo2". This change will resolve the warning

You are using a model of type plamo2 to instantiate a model of type plamo. This is not supported for all configurations of models and can yield errors.

raised from here: https://github.com/huggingface/transformers/blob/6f5dc9c82efd347bcc1941da64739d269e741771/src/transformers/configuration_utils.py#L564-L569 We're also going to implement PlamoConfig.layers_block_type and, hopefully, rename all the "Plamo" prefixed objects to "Plamo2" prefixed.

I'll update this PR after those public model changes. Is there anything else that needs to be fixed in this PR?

Apr 02 '25 11:04 Alnusjaponica

@Alnusjaponica thanks, let me know when those changes are in!

I'll update this PR after those public model changes. Is there anything else that needs to be fixed in this PR?

I don't see anything that needs to be fixed, but I will reemphasize that if possible it would be better to try to use MambaMixer2. I understand that this may not make sense due to differences in the model definition, but it would make it easier to pick up performance improvements and fixes to the mixer layer.

Apr 02 '25 15:04 tlrmchlsmth

@DarkLight1337 Thanks for your review. I've updated the tests. @tlrmchlsmth Now that the public model is updated and I’ve also updated this PR according to the new changes, could you take another look?

Apr 07 '25 11:04 Alnusjaponica

Thanks for your approval! I found some tests fails after I merged the main branch, so I am trying to fix them. I would appreciated it if you could take another look once the test passes.

Apr 15 '25 05:04 Alnusjaponica

It seems unrelated part of the tests are still failing. I'll wait main branch to be fixed.

Apr 15 '25 07:04 Alnusjaponica

Can you update models tests in https://github.com/vllm-project/vllm/blob/70e7ed841d1ab498b51e842a43157432cb752a97/.buildkite/test-pipeline.yaml#L396 to run your model separate from the rest of the models? There is a known memory leak issue for Llama 4 in those tests so perhaps that's also the case for your model.

Apr 15 '25 08:04 DarkLight1337

Sure, it passed in my environment, so I thought it was not relevant either. Let me see if other tests pass with the latest main branch at first.

Apr 15 '25 12:04 Alnusjaponica

@DarkLight1337 Separating plamo2 passes the Basic Model Test, so might be the case for our models. Is there anything I can do to fix this?

Apr 16 '25 02:04 Alnusjaponica

The problem doesn't appear in V1 so let's just merge this.

Apr 16 '25 02:04 DarkLight1337

Thanks for the approval! Next, I'll I will submit a follow-up PR to provide support for TP/PP, etc.

Apr 16 '25 04:04 Alnusjaponica

vllm vllm copied to clipboard

[Model] Add PLaMo2

vllm
vllm copied to clipboard