vllm [Models] Add remaining model PP support

Add all other possible models for PP

FIX #7684

Aug 05 '24 18:08 andoorve

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Aug 05 '24 18:08 github-actions[bot]

Thanks for the great effort! Please lmk when it is ready for review.

Aug 05 '24 22:08 youkaichao

I think you can already start taking a look @youkaichao with @ywang96. I think the major thing here is to ensure all the models work after this change.

Aug 05 '24 23:08 andoorve

Hey @ywang96 can you take a quick look at vision model stuff?

Sep 04 '24 00:09 andoorve

@ywang96 Should be rebased for you - it's untested though!

Sep 24 '24 06:09 andoorve

@ywang96 is busy this week so I'll be taking over the review. I'll add each model to PP test and test them locally.

Sorry for the delay!

Sep 30 '24 15:09 DarkLight1337

@andoorve I'm running the tests locally and found a few failing models so I'm debugging them now. Meanwhile, can you take a look at #9000? It simplifies the process of adding new PP models.

Update: #9000 is now part of this PR as well. We should merge #9000 first though to reduce the diffs.

Oct 03 '24 05:10 DarkLight1337

The vast majority of models are confirmed to work with PP now. My machine doesn't have enough memory to test the following models though:

Arctic (479B model)
DBRX (132B model)

@ywang96 are you able to help test these models? You can adjust tp_base and/or pp_base in the tests.

The following models support PP but are unable to be tested:

Phi-3.5-MoE (cannot be used in online serving, I get the error in #8553)
XVERSE (the model cannot be used in the current version of transformers/vLLM at all because of outdated tokenizer config)

The following models still don't support PP:

Jamba (I am not sure how this would work with the inner state of the model)
Encoder-decoder models (not implemented in model runner)
- BART
- Llama 3.2

Oct 03 '24 13:10 DarkLight1337

@DarkLight1337 Thank you for your help! For #9000, can we pull all the changes into this PR and push it together?

Oct 03 '24 19:10 andoorve

@DarkLight1337 Thank you for your help! For #9000, can we pull all the changes into this PR and push it together?

This PR already includes all the changes from that PR.

Oct 03 '24 19:10 DarkLight1337

I see you mentioned this:

We should merge https://github.com/vllm-project/vllm/pull/9000 first though to reduce the diffs.

Are the diffs resolved in this PR? I.e. can we merge this directly and delete #9000?

Oct 03 '24 19:10 andoorve

I see you mentioned this:

We should merge #9000 first though to reduce the diffs.

Are the diffs resolved in this PR? I.e. can we merge this directly and delete #9000?

Yes.

Oct 03 '24 19:10 DarkLight1337

Need to ungate Pixtral for testing: https://huggingface.co/mistralai/Pixtral-12B-2409

Oct 03 '24 19:10 andoorve

@youkaichao @DarkLight1337 Possible to add this model to the HF account of the testing token?

Oct 03 '24 19:10 andoorve

@youkaichao @DarkLight1337 Possible to add this model to the HF account of the testing token?

added by @simon-mo

Oct 03 '24 20:10 youkaichao

it looks pixtral causes oom

Oct 03 '24 21:10 youkaichao

Hmm how is that possible? Pixtral is only 12B and we have 4 L4

Oct 03 '24 21:10 andoorve

Also tested DBRX and Arctic locally and they passed. Didn't add to test configs cause I had to modify them to work.

Oct 03 '24 21:10 andoorve

Decoder only LM Models test failure:


[2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_big_models.py::test_models[32-half-EleutherAI/gpt-j-6b] - KeyError: 'transformer.h.9.ln_1.weight'
--
  | [2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_big_models.py::test_model_print[half-EleutherAI/gpt-j-6b] - KeyError: 'transformer.h.9.ln_1.weight'
  | [2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_models.py::test_models[96-float-facebook/opt-125m] - RuntimeError: value cannot be converted to type at::Half without overflow

Oct 03 '24 22:10 andoorve

Fixed error 2&3, error 1 passed when testing locally

Oct 04 '24 00:10 andoorve

Looks like fix to pixtral worked.

Oct 04 '24 02:10 andoorve

@youkaichao @DarkLight1337 tests pass fully

Oct 04 '24 02:10 andoorve

Thanks for your hard work as well!

Oct 04 '24 02:10 DarkLight1337

@andoorve @DarkLight1337 thanks for the great work!!!

Oct 04 '24 04:10 youkaichao

@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.

Oct 10 '24 21:10 sekh77

@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.

not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code

Oct 10 '24 22:10 youkaichao

@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.

not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code

Can you please confirm if the latest wheel supports PP for llama 3.2 11B Instruct?

Oct 12 '24 15:10 eByteTheDust

@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.

not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code

Can you please confirm if the latest wheel supports PP for llama 3.2 11B Instruct?

Llama-3.2-Vision specifically isn't PP-supported because PP isn't supported in general for encoder-decoder models yet.

Oct 12 '24 15:10 DarkLight1337

@DarkLight1337 - Is PP supported for DataBricks DBRX model - databricks/dbrx-instruct?

Oct 13 '24 03:10 sekh77

Yes. Please check out the Supported Models page for more details.

Oct 13 '24 03:10 DarkLight1337