vllm Roberta

This is a Draft PR based on PR https://github.com/vllm-project/vllm/pull/9056 to test Roberta embedding models.

python -m vllm.entrypoints.openai.api_server --model /path/to/roberta --served-model-name roberta

To test with the embeddings API:

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Your text string goes here",
    "model": "roberta"
  }'

Oct 15 '24 18:10 maxdebayser

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Oct 15 '24 18:10 github-actions[bot]

Model implementation looks good (though a bit hacky). Can you add tests for the model?

Nov 08 '24 05:11 DarkLight1337

Also, remember to update the Supported Models page.

Nov 08 '24 05:11 DarkLight1337

It would also be nice to consider adding the SequenceClassification variant of the model (to solve #8022)

Nov 08 '24 05:11 DarkLight1337

great! What's the ETA on this???

Nov 08 '24 12:11 rpvelloso

It would also be nice to consider adding the SequenceClassification variant of the model (to solve https://github.com/vllm-project/vllm/issues/8022)

@DarkLight1337 , this would be great yes. I was thinking that we could use your chat embedding API to format sentence pair separated by a separator token as input to sentence classification models. The only problem would be the token type tensor that also has to be passed as input. But maybe this would be outside of the scope of this issue. Maybe we can add this in another PR just to keep the scope of each PR small.

Nov 08 '24 13:11 maxdebayser

Model implementation looks good (though a bit hacky). Can you add tests for the model?

Sure, I'll add the tests. I don't disagree that this is a bit hacky. Should we make the Bert classes more generic so that we can pass the embedding layer class as a parameter?

Nov 08 '24 13:11 maxdebayser

Model implementation looks good (though a bit hacky). Can you add tests for the model?

Sure, I'll add the tests. I don't disagree that this is a bit hacky. Should we make the Bert classes more generic so that we can pass the embedding layer class as a parameter?

That would be great. Another way would be to have an abstract _init_embeddings etc. so subclasses can decide how to initialize each submodule.

Nov 08 '24 13:11 DarkLight1337

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 12 '24 06:11 mergify[bot]

Please fix the formatting errors.

Nov 13 '24 02:11 DarkLight1337

Also we have to merge from main to fix the CI failures.

Nov 13 '24 02:11 DarkLight1337

@DarkLight1337, I realized that with Roberta models the position_ids start at padding_idx + 1 (see here and here )

I've added a line of code to increment all position_ids by that amount. Without this, the results I get in the STS12 task from the MTEB benchmark for intfloat/multilingual-e5-large is this, 0.53, which is way off. With the change I get 0.80, which is correct.

In my tests, all the position_ids in vllm for the embedding use case start with 0 and end with len()-1 and there are no padding tokens because the input tensors are 1-dimensional without padding. For example:

input_ids=tensor([     0,   1284,     70,   1821,   6275,    111,  21455,      6,      4,
            70,  21640,  31486,    111,     70,  15437,    509,     10,  63819,
             6,      5,      2,      0,   2367,     83,   1286,      6,      4,
            70,  16648,    111,  29700,    621,    959, 133888,     47, 137447,
          2363, 102880,    111,  77021,     47,  16839,  27289,      6,      5,
             2], device='cuda:0')
position_ids=tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
        15, 16, 17, 18, 19, 20, 21, 22, 23, 24], device='cuda:0')

Is there a scenario in which there could be a presence of padding tokens? (Except for the case in which the user inserts in the input text).

Nov 14 '24 02:11 maxdebayser

I've added a line of code to increment all position_ids by that amount. Without this, the results I get in the STS12 task from the MTEB benchmark for intfloat/multilingual-e5-large is this, 0.53, which is way off. With the change I get 0.80, which is correct.

Hmm, we may need to add a correctness test that compares against HF then.

Is there a scenario in which there could be a presence of padding tokens? (Except for the case in which the user inserts in the input text).

Don't think so, since vLLM encodes each prompt separately. Just to be sure, you can add an assertion statement so we know if our assumption is false.

Nov 14 '24 03:11 DarkLight1337

It would also be nice to consider adding the SequenceClassification variant of the model (to solve #8022)

Are you still considering adding this to the pr? If not i could try to make an attempt.

Nov 14 '24 09:11 sorenmc

@DarkLight1337 , I've added an assert on the position_ids.

Hmm, we may need to add a correctness test that compares against HF then.

Yes, but it would have to be sentence-transformers. In the transformers library the pooled output is obtained by running the last hidden states through the pooler layer. But in sentence-transformers, this output is discarded and the hidden states are pooled using MEAN, CLS ... like we do and then normalized. Would it be OK to add this dependency? If yes, can we do this in another PR?

Nov 14 '24 14:11 maxdebayser

The existing tests for text-only embedding models already use sentence-transformers, so it should be pretty straightforward to add this model to the list.

Nov 14 '24 14:11 DarkLight1337

The discussion on SequenceClassification models continues here in this other PR: https://github.com/vllm-project/vllm/pull/10400

Nov 18 '24 00:11 maxdebayser

vllm vllm copied to clipboard

Roberta

vllm
vllm copied to clipboard