Support `embedders` setting and other vector/hybrid search related configuration
Pull Request
Related issue
Fixes https://github.com/meilisearch/meilisearch-rust/issues/541 Fixes https://github.com/meilisearch/meilisearch-rust/issues/612 Fixes https://github.com/meilisearch/meilisearch-rust/issues/621 Fixes https://github.com/meilisearch/meilisearch-rust/issues/646
What does this PR do?
-
Adds the required settings
with_embeddersdoes use the same "API" (not usingimpl AsReffor items passed) aswith_synonyms, as this is the closest existing- given
set_embeddershas not been implemented upstream (at least when I try toPATCHthe object, it does not work) - only
{get,reset}_embedderssettings have been implemented. Said implementation goes with the work done in https://github.com/meilisearch/meilisearch-python/pull/924
-
adds the
hybridfield to search via the vector search to add an end-to-end test of this feature with thehuggingfaceconfiguration.userProvidedseens more brittle, but we may want change to this instead usinguserProvidedinstead would mean (at the cost of hardcoding stuff) => lower cpu effort => no higher timeout necceeseary => aligning withmeilisearch/meilisearchto only have a test foruserProvided)
TODO:
- [x] find a combination of semantic search model + configuration that does not fail the assumptions (see search testcase) spectacularly
PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Summary by CodeRabbit
-
New Features
- Added support for hybrid semantic search, allowing users to combine keyword and semantic search with customizable parameters.
- Introduced the ability to provide custom embedding vectors in search queries and retrieve vectors in search results.
- Added comprehensive configuration options for semantic search embedders, supporting multiple providers (HuggingFace, OpenAI, Ollama, REST, and user-provided).
- Enabled management of embedders through new settings and API methods, including fetching, setting, and resetting embedder configurations.
-
Documentation
- Added detailed usage examples and documentation for new semantic search and embedder configuration features.
-
Tests
- Introduced new tests to verify hybrid search, vector retrieval, and embedder management functionalities.
Hello @CommanderStorm
Can you rebase your branch? We made changes recently to improve the library
Sorry for the inconvenience, I try to review your PR as soon as possible after the rebase
Can you rebase your branch?
Done ^^
I donβt know if this is required, @curquiza. Could we simply expose the feature as-is instead of hiding it behind a feature flag (that will make it harder to test and use).
Yes, that's ok not to do feature flag π
I did a few tests, and on my side it never took 120s, the quickest execution took 150s but most of the time I was over 200s
Time after being migrated to userProvided seems fine in CI (this is likely the slowest machine running the tests). β
Or we could do that and actually send the payload to meilisearch.
I think keeping the testcases from requiring a active internet connection is better as otherwise the test might be unnessesarily flaky in CI.
Could you mock Meilisearch as we did here
I can add a testcase where I mock the routes. I am unsure if you actually would want to mock this for an experimental feature (= where the api might change => requiring changes)
I have tweaked a bunch with the vectors available and can't get the test_hybrid testcase to work without fully using the same dataset as in tests/search/hybrid.rs.
It seems to always return everything when I set semantic_ratio=1.0...
Is this operating as intended?
I have tried similar 2D vectors as the upstream test and went as far as to use 1D vectors with considerable (1/10/1k/1m) spread (=>something that as far as I understand embeddings should not match)
=> I am missing something. π Could you point me into the right direction?
[!NOTE] I can obviously steal the testcase from
tests/search/hybrid.rs, but for longer-term maintainability I would like this testcase to not be such an oddball compared to the existing testcases in this repo ^^
@CommanderStorm really sorry for this. Can you fix the git conflicts? π
Thanks for the pinging (github does not give me notifications for this) ^^ No need to worry. If this takes a few months that is fine.
Hey @CommanderStorm, since we introduced a new rest embedder, I think we could write better tests by mocking the rest embedder and ensuring it works with meilisearch; here's some example I wrote earlier this week in meilisearch: https://github.com/meilisearch/meilisearch/pull/4755/commits/2141cb3b69f52ca6fcf18d0b89df1d8601a22f85
The changes requested are implemented.
@curquiza could you please approve a workflow run (I think I did check everything, but you never know)
In retrospective, with_retrieve_vectors should/could have been a different PR. If you want, I can move those changes to a different PR. 600 lines is pretty unreviewable. On the other side, most of the lines changed are docs..
I think I have adressed the requested changes ^^ Sorry for the long delay, I was on vacation and then a bit buisy with other work. Hope you understand.
AND we should absolutely remove these two tests
I have only removed the test which disables the vector store, but have kept the test which .set_vector_store(true)'s.
I have also improved a test a bit to also set, but this should be minor ^^
Next time, I will split such PRs earlier ^^
binaryQuantized will have to be in a different PR, as according to the docs this is only avaliable for updating via PATCH, which is currently not supported.
Docs for this feature: https://meilisearch.notion.site/Binary-quantization-usage-v1-11-2a9c9559461a4a9d9fa3e0ea5149ad68
@irevoire sorry to ping you. I think it would be much more maintainable to do this work in different PRs.
Is merging this feature in the experimental stage still something you are interested in, or would this feature need to be stabilised first?
Any update to this PR? It's very useful for AI search
From a features/tests/documentation POV nothing is blocking merging. I suspect that meilisearch noticed that the other integrations are much more used and is allocating reviewer capacity accordingly.
In terms of docs, my last sync with their docs/features was two months ago. If there is drift, feel free to provide a PR to this PR or copy my changes into a new PR. At this point, I have abandoned the idea of AI search due the increased latency (I don't run on fast hardware) being a noticeably worse experience for my use-case.
Since said functionality is stable now, I am leaving this open. If the team wants some change, I think I can do that too, but only if there is a hope of this leading to merging, which at the moment does not look promising.
You can use this via overriding what meilisearch means
- meilisearch-sdk = "1"
+ meilisearch-sdk = { git = "https://github.com/commanderstorm/meilisearch-rust.git", branch = "vector-search-embedder" }
Hi,
I ran into this error when I try to create an index with settings provided in this PR.
Meilisearch: 1.14
Settings::new()
.with_embedders(HashMap::from([(
"default",
Embedder::OpenAI(OpenAIEmbedderSettings {
api_key: "...",
model: Some("text-embedding-3-small".to_string()),
dimensions: Some(1536),
document_template_max_bytes: Some(6000),
..Default::default()
}),
)]))
Error:
Meilisearch invalid_request: unknown: Unknown value `openAI` at `.embedders.default.source`: did you mean
`openAi`? expected one of `openAi`, `huggingFace`, `ollama`, `userProvided`, `rest`, `composite`. https://docs.me
ilisearch.com/errors#invalid_settings_embedders
[!IMPORTANT]
Review skipped
Auto reviews are disabled on base/target branches other than the default branch.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
Walkthrough
The changes introduce comprehensive support for AI-powered and hybrid semantic search in the SDK. This includes new settings and API methods for configuring multiple embedders with provider-specific options, extended search query capabilities for hybrid and vector-based search, and enhanced test coverage for these features. All updates are additive and maintain existing functionality.
Changes
| File(s) | Change Summary |
|---|---|
src/search.rs |
Added support for hybrid semantic search in the search query API. Introduced HybridSearch struct and extended SearchQuery with hybrid, vector, and retrieve_vectors fields. Added builder methods for these fields. Enhanced tests for vector retrieval and hybrid search, including new vector-related types and test data. |
src/settings.rs |
Introduced comprehensive embedder configuration support through a new Embedder enum and associated structs for HuggingFace, OpenAI, Ollama, REST, and user-provided embedders. Extended Settings with an embedders field and builder method. Added async methods on Index to get and reset embedders. Added integration tests for embedder management and detailed documentation. |
Sequence Diagram(s)
sequenceDiagram
participant Client
participant SDK
participant Meilisearch
Client->>SDK: Configure Settings with Embedders
SDK->>Meilisearch: PATCH /indexes/:uid/settings (with embedders)
Meilisearch-->>SDK: TaskInfo
Client->>SDK: Build SearchQuery (with hybrid/vector/retrieve_vectors)
SDK->>Meilisearch: POST /indexes/:uid/search (with hybrid and/or vector params)
Meilisearch-->>SDK: Search results (optionally with vectors)
SDK-->>Client: Search results
Assessment against linked issues
| Objective | Addressed | Explanation |
|---|---|---|
Support embedders setting in SDK, including builder and JSON structure (#541, #646) |
β | |
| Implement embedder variants and all subfields, including new/changed fields for REST, OpenAI, Ollama, etc. (#612, #646) | β | |
| Add methods to get, set, and reset embedders via SDK methods (#646) | β | |
Add hybrid search support: hybrid param with embedder (mandatory), semanticRatio, vector, retrieveVectors (#621, #646) |
β | |
Remove deprecated REST embedder fields, add new ones (request, response, headers), update tests (#612) |
β |
Poem
In the warren of search, where queries hop,
Embedders and vectors now leap to the top!
Hybrid and semantic, all settings in tow,
With Ollama and OpenAI, see results grow.
π₯ The SDKβs garden is richer todayβ
Hooray for the code that leads the AI way!
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
πͺ§ Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>, please review it.Explain this complex logic.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai explain this code block.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.@coderabbitai read src/utils.ts and explain its main purpose.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Support
Need help? Create a ticket on our support page for assistance with any issues or questions.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai generate docstringsto generate docstrings for this PR.@coderabbitai generate sequence diagramto generate a sequence diagram of the changes in this PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
@awyl could you retry please? Should be fixed now
@awyl could you retry please? Should be fixed now
Thank you, it works perfectly.
Is there anything remaining for this to be merged and released?
The status I have written above is still accurate
From a features/tests/documentation POV nothing is blocking merging. I suspect that meilisearch noticed that the other integrations are much more used and is allocating reviewer capacity accordingly.
In terms of docs, my last sync with their docs/features was two months ago. If there is drift, feel free to provide a PR to this PR or copy my changes into a new PR. At this point, I have abandoned the idea of AI search due the increased latency (I don't run on fast hardware) being a noticeably worse experience for my use-case.
Since said functionality is stable now, I am leaving this open. If the team wants some change, I think I can do that too, but only if there is a hope of this leading to merging, which at the moment does not look promising.
You can use this via overriding what meilisearch means
- meilisearch-sdk = "1" + meilisearch-sdk = { git = "https://github.com/commanderstorm/meilisearch-rust.git", branch = "vector-search-embedder" }
@CommanderStorm I'm going to fix it myself but for your information it was correct behavior that all test documents were considered with a distance of 0 on your 1-dimensional vectors. Meilisearch uses cosine distance, and since all your coordinates are positive, it's like they all point in the exact same direction and hence have perfect similarity
That makes so much sense, thanks so much for taking this over ππ
That makes so much sense, thanks so much for taking this over ππ
Thank you for creating the PR in the first place! Your changes should land on main shortly