[Models] Add remaining model PP support
Add all other possible models for PP
FIX #7684
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
- Comment
/readyon the PR - Add
readylabel to the PR - Enable auto-merge.
🚀
Thanks for the great effort! Please lmk when it is ready for review.
I think you can already start taking a look @youkaichao with @ywang96. I think the major thing here is to ensure all the models work after this change.
Hey @ywang96 can you take a quick look at vision model stuff?
@ywang96 Should be rebased for you - it's untested though!
@ywang96 is busy this week so I'll be taking over the review. I'll add each model to PP test and test them locally.
Sorry for the delay!
@andoorve I'm running the tests locally and found a few failing models so I'm debugging them now. Meanwhile, can you take a look at #9000? It simplifies the process of adding new PP models.
Update: #9000 is now part of this PR as well. We should merge #9000 first though to reduce the diffs.
The vast majority of models are confirmed to work with PP now. My machine doesn't have enough memory to test the following models though:
- Arctic (479B model)
- DBRX (132B model)
@ywang96 are you able to help test these models? You can adjust tp_base and/or pp_base in the tests.
The following models support PP but are unable to be tested:
- Phi-3.5-MoE (cannot be used in online serving, I get the error in #8553)
- XVERSE (the model cannot be used in the current version of transformers/vLLM at all because of outdated tokenizer config)
The following models still don't support PP:
- Jamba (I am not sure how this would work with the inner state of the model)
- Encoder-decoder models (not implemented in model runner)
- BART
- Llama 3.2
@DarkLight1337 Thank you for your help! For #9000, can we pull all the changes into this PR and push it together?
@DarkLight1337 Thank you for your help! For #9000, can we pull all the changes into this PR and push it together?
This PR already includes all the changes from that PR.
I see you mentioned this:
We should merge https://github.com/vllm-project/vllm/pull/9000 first though to reduce the diffs.
Are the diffs resolved in this PR? I.e. can we merge this directly and delete #9000?
I see you mentioned this:
We should merge #9000 first though to reduce the diffs.
Are the diffs resolved in this PR? I.e. can we merge this directly and delete #9000?
Yes.
Need to ungate Pixtral for testing: https://huggingface.co/mistralai/Pixtral-12B-2409
@youkaichao @DarkLight1337 Possible to add this model to the HF account of the testing token?
@youkaichao @DarkLight1337 Possible to add this model to the HF account of the testing token?
added by @simon-mo
it looks pixtral causes oom
Hmm how is that possible? Pixtral is only 12B and we have 4 L4
Also tested DBRX and Arctic locally and they passed. Didn't add to test configs cause I had to modify them to work.
Decoder only LM Models test failure:
[2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_big_models.py::test_models[32-half-EleutherAI/gpt-j-6b] - KeyError: 'transformer.h.9.ln_1.weight'
--
| [2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_big_models.py::test_model_print[half-EleutherAI/gpt-j-6b] - KeyError: 'transformer.h.9.ln_1.weight'
| [2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_models.py::test_models[96-float-facebook/opt-125m] - RuntimeError: value cannot be converted to type at::Half without overflow
Fixed error 2&3, error 1 passed when testing locally
Looks like fix to pixtral worked.
@youkaichao @DarkLight1337 tests pass fully
Thanks for your hard work as well!
@andoorve @DarkLight1337 thanks for the great work!!!
@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.
@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.
not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code
@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.
not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code
Can you please confirm if the latest wheel supports PP for llama 3.2 11B Instruct?
@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.
not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code
Can you please confirm if the latest wheel supports PP for llama 3.2 11B Instruct?
Llama-3.2-Vision specifically isn't PP-supported because PP isn't supported in general for encoder-decoder models yet.
@DarkLight1337 - Is PP supported for DataBricks DBRX model - databricks/dbrx-instruct?
Yes. Please check out the Supported Models page for more details.