vllm
vllm copied to clipboard
Roberta
This is a Draft PR based on PR https://github.com/vllm-project/vllm/pull/9056 to test Roberta embedding models.
python -m vllm.entrypoints.openai.api_server --model /path/to/roberta --served-model-name roberta
To test with the embeddings API:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "Your text string goes here",
"model": "roberta"
}'
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
- Add
readylabel to the PR - Enable auto-merge.
🚀
Model implementation looks good (though a bit hacky). Can you add tests for the model?
Also, remember to update the Supported Models page.
It would also be nice to consider adding the SequenceClassification variant of the model (to solve #8022)
great! What's the ETA on this???
It would also be nice to consider adding the SequenceClassification variant of the model (to solve https://github.com/vllm-project/vllm/issues/8022)
@DarkLight1337 , this would be great yes. I was thinking that we could use your chat embedding API to format sentence pair separated by a separator token as input to sentence classification models. The only problem would be the token type tensor that also has to be passed as input. But maybe this would be outside of the scope of this issue. Maybe we can add this in another PR just to keep the scope of each PR small.
Model implementation looks good (though a bit hacky). Can you add tests for the model?
Sure, I'll add the tests. I don't disagree that this is a bit hacky. Should we make the Bert classes more generic so that we can pass the embedding layer class as a parameter?
Model implementation looks good (though a bit hacky). Can you add tests for the model?
Sure, I'll add the tests. I don't disagree that this is a bit hacky. Should we make the Bert classes more generic so that we can pass the embedding layer class as a parameter?
That would be great. Another way would be to have an abstract _init_embeddings etc. so subclasses can decide how to initialize each submodule.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @maxdebayser.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Please fix the formatting errors.
Also we have to merge from main to fix the CI failures.
@DarkLight1337, I realized that with Roberta models the position_ids start at padding_idx + 1 (see here and here )
I've added a line of code to increment all position_ids by that amount. Without this, the results I get in the STS12 task from the MTEB benchmark for intfloat/multilingual-e5-large is this, 0.53, which is way off. With the change I get 0.80, which is correct.
In my tests, all the position_ids in vllm for the embedding use case start with 0 and end with len()-1 and there are no padding tokens because the input tensors are 1-dimensional without padding. For example:
input_ids=tensor([ 0, 1284, 70, 1821, 6275, 111, 21455, 6, 4,
70, 21640, 31486, 111, 70, 15437, 509, 10, 63819,
6, 5, 2, 0, 2367, 83, 1286, 6, 4,
70, 16648, 111, 29700, 621, 959, 133888, 47, 137447,
2363, 102880, 111, 77021, 47, 16839, 27289, 6, 5,
2], device='cuda:0')
position_ids=tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24], device='cuda:0')
Is there a scenario in which there could be a presence of padding tokens? (Except for the case in which the user inserts
I've added a line of code to increment all position_ids by that amount. Without this, the results I get in the STS12 task from the MTEB benchmark for intfloat/multilingual-e5-large is this, 0.53, which is way off. With the change I get 0.80, which is correct.
Hmm, we may need to add a correctness test that compares against HF then.
Is there a scenario in which there could be a presence of padding tokens? (Except for the case in which the user inserts in the input text).
Don't think so, since vLLM encodes each prompt separately. Just to be sure, you can add an assertion statement so we know if our assumption is false.
It would also be nice to consider adding the
SequenceClassificationvariant of the model (to solve #8022)
Are you still considering adding this to the pr? If not i could try to make an attempt.
@DarkLight1337 , I've added an assert on the position_ids.
Hmm, we may need to add a correctness test that compares against HF then.
Yes, but it would have to be sentence-transformers. In the transformers library the pooled output is obtained by running the last hidden states through the pooler layer. But in sentence-transformers, this output is discarded and the hidden states are pooled using MEAN, CLS ... like we do and then normalized. Would it be OK to add this dependency? If yes, can we do this in another PR?
The existing tests for text-only embedding models already use sentence-transformers, so it should be pretty straightforward to add this model to the list.
The discussion on SequenceClassification models continues here in this other PR: https://github.com/vllm-project/vllm/pull/10400