meilisearch-rust icon indicating copy to clipboard operation
meilisearch-rust copied to clipboard

Support `embedders` setting and other vector/hybrid search related configuration

Open CommanderStorm opened this issue 1 year ago β€’ 19 comments

Pull Request

Related issue

Fixes https://github.com/meilisearch/meilisearch-rust/issues/541 Fixes https://github.com/meilisearch/meilisearch-rust/issues/612 Fixes https://github.com/meilisearch/meilisearch-rust/issues/621 Fixes https://github.com/meilisearch/meilisearch-rust/issues/646

What does this PR do?

  • Adds the required settings

    • with_embedders does use the same "API" (not using impl AsRef for items passed) as with_synonyms, as this is the closest existing
    • given set_embedders has not been implemented upstream (at least when I try to PATCH the object, it does not work)
    • only {get,reset}_embedders settings have been implemented. Said implementation goes with the work done in https://github.com/meilisearch/meilisearch-python/pull/924
  • adds the hybrid field to search via the vector search to add an end-to-end test of this feature with the huggingface configuration.

    userProvided seens more brittle, but we may want change to this instead using userProvided instead would mean (at the cost of hardcoding stuff) => lower cpu effort => no higher timeout necceeseary => aligning with meilisearch/meilisearch to only have a test for userProvided)

TODO:

  • [x] find a combination of semantic search model + configuration that does not fail the assumptions (see search testcase) spectacularly

PR checklist

Please check if your PR fulfills the following requirements:

  • [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • [x] Have you read the contributing guidelines?
  • [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Summary by CodeRabbit

  • New Features

    • Added support for hybrid semantic search, allowing users to combine keyword and semantic search with customizable parameters.
    • Introduced the ability to provide custom embedding vectors in search queries and retrieve vectors in search results.
    • Added comprehensive configuration options for semantic search embedders, supporting multiple providers (HuggingFace, OpenAI, Ollama, REST, and user-provided).
    • Enabled management of embedders through new settings and API methods, including fetching, setting, and resetting embedder configurations.
  • Documentation

    • Added detailed usage examples and documentation for new semantic search and embedder configuration features.
  • Tests

    • Introduced new tests to verify hybrid search, vector retrieval, and embedder management functionalities.

CommanderStorm avatar Mar 02 '24 15:03 CommanderStorm

Hello @CommanderStorm

Can you rebase your branch? We made changes recently to improve the library

Sorry for the inconvenience, I try to review your PR as soon as possible after the rebase

curquiza avatar Apr 15 '24 17:04 curquiza

Can you rebase your branch?

Done ^^

CommanderStorm avatar Apr 16 '24 03:04 CommanderStorm

I don’t know if this is required, @curquiza. Could we simply expose the feature as-is instead of hiding it behind a feature flag (that will make it harder to test and use).

Yes, that's ok not to do feature flag 😊

curquiza avatar Apr 17 '24 07:04 curquiza

I did a few tests, and on my side it never took 120s, the quickest execution took 150s but most of the time I was over 200s

Time after being migrated to userProvided seems fine in CI (this is likely the slowest machine running the tests). βœ…

Or we could do that and actually send the payload to meilisearch.

I think keeping the testcases from requiring a active internet connection is better as otherwise the test might be unnessesarily flaky in CI.

Could you mock Meilisearch as we did here

I can add a testcase where I mock the routes. I am unsure if you actually would want to mock this for an experimental feature (= where the api might change => requiring changes)

I have tweaked a bunch with the vectors available and can't get the test_hybrid testcase to work without fully using the same dataset as in tests/search/hybrid.rs.

It seems to always return everything when I set semantic_ratio=1.0... Is this operating as intended?

I have tried similar 2D vectors as the upstream test and went as far as to use 1D vectors with considerable (1/10/1k/1m) spread (=>something that as far as I understand embeddings should not match)

=> I am missing something. πŸ˜… Could you point me into the right direction?

[!NOTE] I can obviously steal the testcase from tests/search/hybrid.rs, but for longer-term maintainability I would like this testcase to not be such an oddball compared to the existing testcases in this repo ^^

CommanderStorm avatar Apr 17 '24 16:04 CommanderStorm

@CommanderStorm really sorry for this. Can you fix the git conflicts? 😊

curquiza avatar Jul 01 '24 12:07 curquiza

Thanks for the pinging (github does not give me notifications for this) ^^ No need to worry. If this takes a few months that is fine.

CommanderStorm avatar Jul 01 '24 16:07 CommanderStorm

Hey @CommanderStorm, since we introduced a new rest embedder, I think we could write better tests by mocking the rest embedder and ensuring it works with meilisearch; here's some example I wrote earlier this week in meilisearch: https://github.com/meilisearch/meilisearch/pull/4755/commits/2141cb3b69f52ca6fcf18d0b89df1d8601a22f85

irevoire avatar Jul 03 '24 14:07 irevoire

The changes requested are implemented.

@curquiza could you please approve a workflow run (I think I did check everything, but you never know)

In retrospective, with_retrieve_vectors should/could have been a different PR. If you want, I can move those changes to a different PR. 600 lines is pretty unreviewable. On the other side, most of the lines changed are docs..

CommanderStorm avatar Jul 08 '24 13:07 CommanderStorm

I think I have adressed the requested changes ^^ Sorry for the long delay, I was on vacation and then a bit buisy with other work. Hope you understand.

AND we should absolutely remove these two tests

I have only removed the test which disables the vector store, but have kept the test which .set_vector_store(true)'s. I have also improved a test a bit to also set, but this should be minor ^^

Next time, I will split such PRs earlier ^^

CommanderStorm avatar Aug 24 '24 18:08 CommanderStorm

binaryQuantized will have to be in a different PR, as according to the docs this is only avaliable for updating via PATCH, which is currently not supported.

Docs for this feature: https://meilisearch.notion.site/Binary-quantization-usage-v1-11-2a9c9559461a4a9d9fa3e0ea5149ad68

CommanderStorm avatar Nov 17 '24 20:11 CommanderStorm

@irevoire sorry to ping you. I think it would be much more maintainable to do this work in different PRs.

Is merging this feature in the experimental stage still something you are interested in, or would this feature need to be stabilised first?

CommanderStorm avatar Nov 17 '24 20:11 CommanderStorm

Any update to this PR? It's very useful for AI search

Akagi201 avatar Apr 24 '25 11:04 Akagi201

From a features/tests/documentation POV nothing is blocking merging. I suspect that meilisearch noticed that the other integrations are much more used and is allocating reviewer capacity accordingly.

In terms of docs, my last sync with their docs/features was two months ago. If there is drift, feel free to provide a PR to this PR or copy my changes into a new PR. At this point, I have abandoned the idea of AI search due the increased latency (I don't run on fast hardware) being a noticeably worse experience for my use-case.

Since said functionality is stable now, I am leaving this open. If the team wants some change, I think I can do that too, but only if there is a hope of this leading to merging, which at the moment does not look promising.

You can use this via overriding what meilisearch means

- meilisearch-sdk = "1"
+ meilisearch-sdk = { git = "https://github.com/commanderstorm/meilisearch-rust.git", branch = "vector-search-embedder" }

CommanderStorm avatar Apr 24 '25 13:04 CommanderStorm

Hi,

I ran into this error when I try to create an index with settings provided in this PR.

Meilisearch: 1.14

Settings::new()
    .with_embedders(HashMap::from([(
        "default",
        Embedder::OpenAI(OpenAIEmbedderSettings {
            api_key: "...",
            model: Some("text-embedding-3-small".to_string()),
            dimensions: Some(1536),
            document_template_max_bytes: Some(6000),
            ..Default::default()
        }),
    )]))

Error:

Meilisearch invalid_request: unknown: Unknown value `openAI` at `.embedders.default.source`: did you mean 
`openAi`? expected one of `openAi`, `huggingFace`, `ollama`, `userProvided`, `rest`, `composite`. https://docs.me
ilisearch.com/errors#invalid_settings_embedders

awyl avatar May 15 '25 21:05 awyl

[!IMPORTANT]

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes introduce comprehensive support for AI-powered and hybrid semantic search in the SDK. This includes new settings and API methods for configuring multiple embedders with provider-specific options, extended search query capabilities for hybrid and vector-based search, and enhanced test coverage for these features. All updates are additive and maintain existing functionality.

Changes

File(s) Change Summary
src/search.rs Added support for hybrid semantic search in the search query API. Introduced HybridSearch struct and extended SearchQuery with hybrid, vector, and retrieve_vectors fields. Added builder methods for these fields. Enhanced tests for vector retrieval and hybrid search, including new vector-related types and test data.
src/settings.rs Introduced comprehensive embedder configuration support through a new Embedder enum and associated structs for HuggingFace, OpenAI, Ollama, REST, and user-provided embedders. Extended Settings with an embedders field and builder method. Added async methods on Index to get and reset embedders. Added integration tests for embedder management and detailed documentation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SDK
    participant Meilisearch

    Client->>SDK: Configure Settings with Embedders
    SDK->>Meilisearch: PATCH /indexes/:uid/settings (with embedders)
    Meilisearch-->>SDK: TaskInfo

    Client->>SDK: Build SearchQuery (with hybrid/vector/retrieve_vectors)
    SDK->>Meilisearch: POST /indexes/:uid/search (with hybrid and/or vector params)
    Meilisearch-->>SDK: Search results (optionally with vectors)
    SDK-->>Client: Search results

Assessment against linked issues

Objective Addressed Explanation
Support embedders setting in SDK, including builder and JSON structure (#541, #646) βœ…
Implement embedder variants and all subfields, including new/changed fields for REST, OpenAI, Ollama, etc. (#612, #646) βœ…
Add methods to get, set, and reset embedders via SDK methods (#646) βœ…
Add hybrid search support: hybrid param with embedder (mandatory), semanticRatio, vector, retrieveVectors (#621, #646) βœ…
Remove deprecated REST embedder fields, add new ones (request, response, headers), update tests (#612) βœ…

Poem

In the warren of search, where queries hop,
Embedders and vectors now leap to the top!
Hybrid and semantic, all settings in tow,
With Ollama and OpenAI, see results grow.
πŸ₯• The SDK’s garden is richer todayβ€”
Hooray for the code that leads the AI way!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share
πŸͺ§ Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot] avatar May 19 '25 15:05 coderabbitai[bot]

@awyl could you retry please? Should be fixed now

CommanderStorm avatar May 19 '25 15:05 CommanderStorm

@awyl could you retry please? Should be fixed now

Thank you, it works perfectly.

awyl avatar May 19 '25 20:05 awyl

Is there anything remaining for this to be merged and released?

kylebrock avatar Jun 12 '25 18:06 kylebrock

The status I have written above is still accurate

From a features/tests/documentation POV nothing is blocking merging. I suspect that meilisearch noticed that the other integrations are much more used and is allocating reviewer capacity accordingly.

In terms of docs, my last sync with their docs/features was two months ago. If there is drift, feel free to provide a PR to this PR or copy my changes into a new PR. At this point, I have abandoned the idea of AI search due the increased latency (I don't run on fast hardware) being a noticeably worse experience for my use-case.

Since said functionality is stable now, I am leaving this open. If the team wants some change, I think I can do that too, but only if there is a hope of this leading to merging, which at the moment does not look promising.

You can use this via overriding what meilisearch means

- meilisearch-sdk = "1"
+ meilisearch-sdk = { git = "https://github.com/commanderstorm/meilisearch-rust.git", branch = "vector-search-embedder" }

CommanderStorm avatar Jun 13 '25 02:06 CommanderStorm

@CommanderStorm I'm going to fix it myself but for your information it was correct behavior that all test documents were considered with a distance of 0 on your 1-dimensional vectors. Meilisearch uses cosine distance, and since all your coordinates are positive, it's like they all point in the exact same direction and hence have perfect similarity

Mubelotix avatar Jul 10 '25 14:07 Mubelotix

That makes so much sense, thanks so much for taking this over πŸ™πŸ˜…

CommanderStorm avatar Jul 10 '25 14:07 CommanderStorm

That makes so much sense, thanks so much for taking this over πŸ™πŸ˜…

Thank you for creating the PR in the first place! Your changes should land on main shortly

Mubelotix avatar Jul 10 '25 15:07 Mubelotix