vllm
vllm copied to clipboard
[CI] Add mteb testing to test the accuracy of the embedding model
Summary:
- Add mteb STS12 testing to test the accuracy of the embedding model (tests/entrypoints/openai/correctness/test_mteb.py)
- test snowflake_arctic_embed using mteb STS12. (tests/models/embedding/language/test_snowflake_arctic_embed.py)
- Snowflake/snowflake-arctic-embed-m-long
If during inference it is found that seqlen > max_trained_positions, it will automatically change from NomicBertRotaryEmbedding to NomicBertDynamicNTKRotaryEmbedding.
https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/blob/main/modeling_hf_nomic_bert.py#L639 https://huggingface.co/nomic-ai/nomic-bert-2048/blob/main/modeling_hf_nomic_bert.py#L1413
if seqlen > self.max_position_embeddings:
base = self.base * (
(self.rotary_scaling_factor * seqlen / self.max_position_embeddings) - (self.rotary_scaling_factor - 1)
) ** (self.dim / (self.dim - 2))
inv_freq = self._compute_inv_freq(base=base, device=device)
self.register_buffer("inv_freq", inv_freq, persistent=False)
It might lead to hard-to-detect bugs.
we ignore config.rotary_scaling_factor so that for datasets shorter than max_trained_positions 2048, the results are consistent with SentenceTransformer.
The context extension uses vllm style rope_theta and rope_scaling.
Task selection
Although mteb v2 significantly speeds up the testing process, it still requires several hours to complete all tests.
Here choose a small test task in mteb: STS12.
-
Running time on 4090 is approximately 26.95s
-
The score for this test set is strongly numerical stability ( <1e-4 ) with respect to minor variations in the model implementation and tensor data types.
-
The score differences between different models are also quite noticeable ( >1e-3).
This makes mteb STS12 a great embedding model test set.
numerical stability
float16
VLLM main score: 0.787317856459396
SentenceTransformer main score: 0.7873427091972599
Difference: 2.4852737863900742e-05
bfloat16
VLLM main score: 0.7873663350672234
SentenceTransformer main score: 0.7873427091972599
Difference: -2.3625869963517232e-05
The difference is very subtle (<1e-4) at least on this test set.
Ten rounds:
float32 2.2034500413159464e-07 2.2674275951409703e-06
float16 -1.2960828871366736e-05 6.329514177900761e-06
bfloat16 -0.0001093438180336248 3.559915712887334e-05
The results of ten iterations seem to show that converting float32 to float16 yields better results than bfloat16 (vllm defaults to converting float32 to float16).
more about numerical stability
Most models exhibit excellent numerical stability
| model name | st_main_score | Difference | std |
|---|---|---|---|
| BAAI/bge-m3 | 0.7873424632849964 | -4.014647728589615e-06 | 1.266416587059263e-05 |
| BAAI/bge-base-en-v1.5 | 0.7802846624612514 | 1.1294266950234721e-05 | 6.865350381034025e-06 |
| Snowflake/snowflake-arctic-embed-xs | 0.7149276890717804 | 1.5002530987628937e-05 | 5.132361246049283e-06 |
| Snowflake/snowflake-arctic-embed-s | 0.7408120447186094 | 1.2957674633273797e-05 | 5.364178900440517e-06 |
| Snowflake/snowflake-arctic-embed-m | 0.6467522411844727 | -3.727433978584216e-06 | 8.904071772230203e-06 |
| Snowflake/snowflake-arctic-embed-l | 0.6362746289758823 | 9.515755331335196e-05 | 2.023830079795977e-05 |
| Snowflake/snowflake-arctic-embed-m-long | 0.6811445157066163 | 3.396798716037708e-05 | 1.224356222837439e-05 |
| Snowflake/snowflake-arctic-embed-m-v1.5 | 0.6490882209298032 | 1.8871733633019083e-05 | 6.591107037250243e-06 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.7122583106737259 | 1.074976228776503e-05 | 1.3400689624215418e-05 |
| Snowflake/snowflake-arctic-embed-m-v2.0 | 0.7066229164460937 | 1.5418442692483048e-05 | 9.792523972420118e-06 |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | 0.7280529229028553 | 5.124313459714536e-05 | 1.6385524234026275e-05 |
slightly numerically unstable model
- intfloat/multilingual-e5-small shows a significant drop when using fp16 , and fp32 needs to be used.
fp16: | intfloat/multilingual-e5-small | 0.7805425596252846 | -0.2749311085815237 | 0.006216913108536066 |
fp32: | intfloat/multilingual-e5-small | 0.7805425596252846 | -1.6403316041024851e-06 | 7.53539269543218e-06 |
- intfloat/multilingual-e5-large-instruct shows a significant drop when using fp16 , and fp32 needs to be used.
pooling_type="MEAN" + fp16 (default) intfloat/multilingual-e5-large-instruct 0.8224491209469045 -0.28623335791513993 0.007169234312147499
pooling_type="MEAN" + fp32 intfloat/multilingual-e5-large-instruct 0.8224491209469045 -2.3497119421289625e-06 7.898194995699927e-06
- jinaai/jina-embeddings-v3 shows a slight drop when using fp16.
fp16: | jinaai/jina-embeddings-v3 | 0.7834129787836271 | -0.0709833671361465 | 0.004834963031278825 | fp32: | jinaai/jina-embeddings-v3 | 0.8243646209061513 | -3.119267999662778e-05 | 6.651161140301139e-06 |
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Thanks for adding this, can you fix pre-commit?
@DarkLight1337
Delete the benchmarks/eval/test_mteb.py finally, or change it into a more general testing script.
cc @mgoin @comaniac do you think we can incorporate this eval script into our existing scripts? Or would it be better to keep them separate?
Hmm I'm not sure we want to have benchmark/evals. For correctness checking in the CI, we should be able to just test 2-3 cases to keep the stability.
@DarkLight1337
I tested more models, and most of them showed strong numerical stability ( <1e-4 ), even better than I imagined.
The score differences between different models are also quite noticeable ( >1e-3).
This makes mteb STS12 a great embedding model test set.
Thanks for your patience!
Thanks for reviewing.
By carefully studying the code, I indeed learned a lot of weird stuff.
@DarkLight1337
What should I do to make the merge go more smoothly?
The failing doc build seems related to this PR. Maybe because you changed the dependencies
need manually merge #16859 first
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @noooop.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Doc build still fails
It seems that awscli isn't used anywhere in vLLM. Let me open another PR to remove it first
Opened #17532
Is pulling a model from s3 dependent on awscli?
I think the issue is not with awscli; I need to set up a local environment for the docs and check it out.
Is pulling a model from s3 dependent on awscli?
No, we use boto3 for that
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @noooop.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
mainly, mteb depends on "datasets>=2.19.0,<3.0.0" is too old
To avoid dependency conflicts maybe you can just pip install mteb for entrypoints tests only
wtf
Uninstalling mteb-1.38.3:
Would remove:
/share/anaconda3/bin/mteb
/share/anaconda3/lib/python3.12/site-packages/docs/*
/share/anaconda3/lib/python3.12/site-packages/mteb-1.38.3.dist-info/*
/share/anaconda3/lib/python3.12/site-packages/mteb/*
mteb installed a library named docs for python, which caused:
Traceback
=========
File "/share/PycharmProjects/noooop_vllm/docs/source/conf.py", line 137, in setup
from docs.source.generate_examples import generate_examples
ModuleNotFoundError: No module named 'docs.source'
Wouldn't this also remove the "correct" docs installation?
normal there shouldn't be a library called docs, otherwise it would cause many problems
I first tried to see if it could pass the test, very sorry for using a lot of CI resources
Let's try the CI again
Let's try the CI again
QvQ
Need to install mteb for model tests as well.
@DarkLight1337
https://buildkite.com/vllm/ci/builds/19198
Last check, are the errors for V1 Test and Speculative decoding tests unrelated to this pr?
Yeah they are unrelated
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @noooop.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork