vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Models] Add remaining model PP support

Open andoorve opened this issue 1 year ago • 20 comments

Add all other possible models for PP

FIX #7684

andoorve avatar Aug 05 '24 18:08 andoorve

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

github-actions[bot] avatar Aug 05 '24 18:08 github-actions[bot]

Thanks for the great effort! Please lmk when it is ready for review.

youkaichao avatar Aug 05 '24 22:08 youkaichao

I think you can already start taking a look @youkaichao with @ywang96. I think the major thing here is to ensure all the models work after this change.

andoorve avatar Aug 05 '24 23:08 andoorve

Hey @ywang96 can you take a quick look at vision model stuff?

andoorve avatar Sep 04 '24 00:09 andoorve

@ywang96 Should be rebased for you - it's untested though!

andoorve avatar Sep 24 '24 06:09 andoorve

@ywang96 is busy this week so I'll be taking over the review. I'll add each model to PP test and test them locally.

Sorry for the delay!

DarkLight1337 avatar Sep 30 '24 15:09 DarkLight1337

@andoorve I'm running the tests locally and found a few failing models so I'm debugging them now. Meanwhile, can you take a look at #9000? It simplifies the process of adding new PP models.

Update: #9000 is now part of this PR as well. We should merge #9000 first though to reduce the diffs.

DarkLight1337 avatar Oct 03 '24 05:10 DarkLight1337

The vast majority of models are confirmed to work with PP now. My machine doesn't have enough memory to test the following models though:

  • Arctic (479B model)
  • DBRX (132B model)

@ywang96 are you able to help test these models? You can adjust tp_base and/or pp_base in the tests.

The following models support PP but are unable to be tested:

  • Phi-3.5-MoE (cannot be used in online serving, I get the error in #8553)
  • XVERSE (the model cannot be used in the current version of transformers/vLLM at all because of outdated tokenizer config)

The following models still don't support PP:

  • Jamba (I am not sure how this would work with the inner state of the model)
  • Encoder-decoder models (not implemented in model runner)
    • BART
    • Llama 3.2

DarkLight1337 avatar Oct 03 '24 13:10 DarkLight1337

@DarkLight1337 Thank you for your help! For #9000, can we pull all the changes into this PR and push it together?

andoorve avatar Oct 03 '24 19:10 andoorve

@DarkLight1337 Thank you for your help! For #9000, can we pull all the changes into this PR and push it together?

This PR already includes all the changes from that PR.

DarkLight1337 avatar Oct 03 '24 19:10 DarkLight1337

I see you mentioned this:

We should merge https://github.com/vllm-project/vllm/pull/9000 first though to reduce the diffs.

Are the diffs resolved in this PR? I.e. can we merge this directly and delete #9000?

andoorve avatar Oct 03 '24 19:10 andoorve

I see you mentioned this:

We should merge #9000 first though to reduce the diffs.

Are the diffs resolved in this PR? I.e. can we merge this directly and delete #9000?

Yes.

DarkLight1337 avatar Oct 03 '24 19:10 DarkLight1337

Need to ungate Pixtral for testing: https://huggingface.co/mistralai/Pixtral-12B-2409

andoorve avatar Oct 03 '24 19:10 andoorve

@youkaichao @DarkLight1337 Possible to add this model to the HF account of the testing token?

andoorve avatar Oct 03 '24 19:10 andoorve

@youkaichao @DarkLight1337 Possible to add this model to the HF account of the testing token?

added by @simon-mo

youkaichao avatar Oct 03 '24 20:10 youkaichao

it looks pixtral causes oom

youkaichao avatar Oct 03 '24 21:10 youkaichao

Hmm how is that possible? Pixtral is only 12B and we have 4 L4

andoorve avatar Oct 03 '24 21:10 andoorve

Also tested DBRX and Arctic locally and they passed. Didn't add to test configs cause I had to modify them to work.

andoorve avatar Oct 03 '24 21:10 andoorve

Decoder only LM Models test failure:


[2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_big_models.py::test_models[32-half-EleutherAI/gpt-j-6b] - KeyError: 'transformer.h.9.ln_1.weight'
--
  | [2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_big_models.py::test_model_print[half-EleutherAI/gpt-j-6b] - KeyError: 'transformer.h.9.ln_1.weight'
  | [2024-10-03T21:16:08Z] FAILED models/decoder_only/language/test_models.py::test_models[96-float-facebook/opt-125m] - RuntimeError: value cannot be converted to type at::Half without overflow

andoorve avatar Oct 03 '24 22:10 andoorve

Fixed error 2&3, error 1 passed when testing locally

andoorve avatar Oct 04 '24 00:10 andoorve

Looks like fix to pixtral worked.

andoorve avatar Oct 04 '24 02:10 andoorve

@youkaichao @DarkLight1337 tests pass fully

andoorve avatar Oct 04 '24 02:10 andoorve

Thanks for your hard work as well!

DarkLight1337 avatar Oct 04 '24 02:10 DarkLight1337

@andoorve @DarkLight1337 thanks for the great work!!!

youkaichao avatar Oct 04 '24 04:10 youkaichao

@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.

sekh77 avatar Oct 10 '24 21:10 sekh77

@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.

not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code

youkaichao avatar Oct 10 '24 22:10 youkaichao

@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.

not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code

Can you please confirm if the latest wheel supports PP for llama 3.2 11B Instruct?

eByteTheDust avatar Oct 12 '24 15:10 eByteTheDust

@youkaichao - Is this change now available in version 0.6.2? I have a requirement to load LLaMA 3.2 90B vision model across four GPUs spread across two nodes using pipeline parallel.

not yet, but you can always use the latest wheel from the main branch. see https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code

Can you please confirm if the latest wheel supports PP for llama 3.2 11B Instruct?

Llama-3.2-Vision specifically isn't PP-supported because PP isn't supported in general for encoder-decoder models yet.

DarkLight1337 avatar Oct 12 '24 15:10 DarkLight1337

@DarkLight1337 - Is PP supported for DataBricks DBRX model - databricks/dbrx-instruct?

sekh77 avatar Oct 13 '24 03:10 sekh77

Yes. Please check out the Supported Models page for more details.

DarkLight1337 avatar Oct 13 '24 03:10 DarkLight1337