vllm
vllm copied to clipboard
[Model] Add PLaMo2
This PR adds support for PLaMo2, specifically PLaMo2 1B and PLaMo2 8B. PLaMo2 is a hybrid mamba2 architecture featuring sliding window attention.
In this PR, we have created the inference architecture for the PLaMo2 model.
Please note that some functionalities such as PP/TP, quantization, and LoRA are not yet supported in this PR. We plan to address these features in subsequent follow-up PRs.
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
This is my first time submitting a PR to this repository, and I just joined the Slack. I believe I do not have permission to unblock additional CIs, so I would appreciate it if you could add me to the Buildkite org. If I have missed anything or if there are areas for improvement, please do let me know.
@tlrmchlsmth can you do a quick pass to check that the implementation of this model fits our architecture? No need to test for correctness since they are the model vendor
Thank you for your review. I have corrected the parts that could be fixed immediately, and I will address the other comments within a few days.
I'm not getting any output generated from following simple script (on an H100). Could you take a look?
Somehow float16 is used by default in vLLM, but our model does not support it. Could you specify dtype=bfloat16 or dtype=float32? We would appreciate it if you could also tell us how to set the default dtype in the vLLM implementaion.
I'm not getting any output generated from following simple script (on an H100). Could you take a look?
Somehow
float16is used by default in vLLM, but our model does not support it. Could you specifydtype=bfloat16ordtype=float32? We would appreciate it if you could also tell us how to set the defaultdtypein the vLLM implementaion.
setting dtype to bf16 works for me.
We used to do the following for gemma models. That could be an option in this case
https://github.com/vllm-project/vllm/blob/8b3e94a3575a0f36fe22b55d2a584b411e3b83f1/vllm/config.py#L2536-L2541
@DarkLight1337 do you have any better ideas? (Also do you know why that snippet got removed in #14858?)
@Alnusjaponica do you know why fp16 doesn't work? Likely we are over or underflowing somewhere if this is happening, which may be possible to avoid. In any case we should log a warning if someone tries to run this model with fp16 dtype
@DarkLight1337 do you have any better ideas? (Also do you know why that snippet got removed in https://github.com/vllm-project/vllm/pull/14858?)
The proper way to do it would be to update the config.json on HF side with a torch_dtype field. After setting that field, the "tensor type" on the HF model card should be shown as BF16, and using AutoModel.from_pretrained(..., torch_dtype="auto") should resolve to the correct dtype. That piece of code was removed because vLLM can now correctly read the dtype from the HF config.
@DarkLight1337 do you have any better ideas? (Also do you know why that snippet got removed in #14858?)
The proper way to do it would be to update the
config.jsonon HF side with atorch_dtypefield. After setting that field, the "tensor type" on the HF model card should be shown as BF16, and usingAutoModel.from_pretrained(..., torch_dtype="auto")should resolve to the correct dtype. That piece of code was removed because vLLM can now correctly read the dtype from the HF config.
I think the actual weights are stored as fp32 though - is that a problem?
I think that using AutoModel.from_pretrained without setting torch_dtype should load the weights in the original dtype, you can try it.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Alnusjaponica.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Thanks for your suggestions. It seems the weight's dtype is downcasted here:
https://github.com/vllm-project/vllm/blob/0a298ea4181a9b324afdb24876560b7e98e69bba/vllm/config.py#L2649
So, I am decided to edit vllm/vllm/config.py to use bfloat16 by default.
@Alnusjaponica do you know why fp16 doesn't work? Likely we are over or underflowing somewhere if this is happening, which may be possible to avoid. In any case we should log a warning if someone tries to run this model with fp16 dtype
We think you're right, and it seems numerically unstable in the attention layer. As you mentioned, it may be possible to avoid this issue, but the implementation cost might not be worth it. For now, I have added a warning log when both the plamo2 model and float16 are specified.
I also noticed that this PR is affected by https://github.com/vllm-project/vllm/issues/15238 after I merge the latest changes from the main branch, so I need to specify VLLM_ATTENTION_BACKEND=XFORMERS.
I also noticed that this PR is affected by #15238 after I merge the latest changes from the main branch, so I need to specify
VLLM_ATTENTION_BACKEND=XFORMERS.
Does this still happen even after recompiling?
We think you're right, and it seems numerically unstable in the attention layer. As you mentioned, it may be possible to avoid this issue, but the implementation cost might not be worth it. For now, I have added a warning log when both the plamo2 model and float16 are specified.
That sounds good to me.
Does this still happen even after recompiling?
It resolved after I recompiled. Thanks a lot.
We're going to fix PlamoConfig.model_type in modeling_plamo.py to "plamo2". This change will resolve the warning
You are using a model of type plamo2 to instantiate a model of type plamo. This is not supported for all configurations of models and can yield errors.
raised from here: https://github.com/huggingface/transformers/blob/6f5dc9c82efd347bcc1941da64739d269e741771/src/transformers/configuration_utils.py#L564-L569
We're also going to implement PlamoConfig.layers_block_type and, hopefully, rename all the "Plamo" prefixed objects to "Plamo2" prefixed.
I'll update this PR after those public model changes. Is there anything else that needs to be fixed in this PR?
@Alnusjaponica thanks, let me know when those changes are in!
I'll update this PR after those public model changes. Is there anything else that needs to be fixed in this PR?
I don't see anything that needs to be fixed, but I will reemphasize that if possible it would be better to try to use MambaMixer2. I understand that this may not make sense due to differences in the model definition, but it would make it easier to pick up performance improvements and fixes to the mixer layer.
@DarkLight1337 Thanks for your review. I've updated the tests. @tlrmchlsmth Now that the public model is updated and I’ve also updated this PR according to the new changes, could you take another look?
Thanks for your approval! I found some tests fails after I merged the main branch, so I am trying to fix them. I would appreciated it if you could take another look once the test passes.
It seems unrelated part of the tests are still failing. I'll wait main branch to be fixed.
Can you update models tests in https://github.com/vllm-project/vllm/blob/70e7ed841d1ab498b51e842a43157432cb752a97/.buildkite/test-pipeline.yaml#L396 to run your model separate from the rest of the models? There is a known memory leak issue for Llama 4 in those tests so perhaps that's also the case for your model.
Sure, it passed in my environment, so I thought it was not relevant either. Let me see if other tests pass with the latest main branch at first.
@DarkLight1337
Separating plamo2 passes the Basic Model Test, so might be the case for our models. Is there anything I can do to fix this?
The problem doesn't appear in V1 so let's just merge this.
Thanks for the approval! Next, I'll I will submit a follow-up PR to provide support for TP/PP, etc.