documentation icon indicating copy to clipboard operation
documentation copied to clipboard

Consider improving the document template in the "Getting started with AI search" guide

Open dureuill opened this issue 7 months ago • 11 comments

The document template in this guide only uses the name from the document, which may result in subpar relevancy.

Furthermore, the link to the dataset appears dead from here: https://www.meilisearch.com/docs/assets/datasets/kitchenware.json

  1. Consider linking the document template best practices from the section that creates the template.
  2. If the dataset allows, consider including parts of the description or other semantically relevant fields for the kitchen material in the document template
  3. Should (2) prove difficult, then consider mentioning that this is a simple template for a sample dataset, and instruct to follow the best practices to optimize relevancy for production data

Not directly relevant to the document template, but mentioning somewhere in that guide that a rankingScoreThreshold should probably be used with semantic search would also help as it is a common pitfall for users.

Thank you for your consideration 😃

Context: https://github.com/meilisearch/meilisearch/issues/5608#issuecomment-2938951373

dureuill avatar Jun 04 '25 07:06 dureuill

https://www.meilisearch.com/docs/learn/ai_powered_search/document_template_best_practices#only-include-highly-relevant-information

I think the filters section needs more explanation. I would say genre and release data are highly relevant, and not simply good for filters.

So maybe add an example that includes some filters and other fields. If this were a news story, I might want to add that's it's an opinion piece rather than hard news, and publish_date is extremely important.

the liquid template engine is powerful enough for something more sophisticated, but the examples are so trivial that it doesn't help the newcomer with how to properly configure the content.

I've seen a reference to metadata, and that also needs to be more fully documented.

In my opinion as a newcomer to AI, Meilisearch is in a unique position because in addition to the full-text search capabilities that already exist, the built-in filters/faceting are enormously powerful.

I think that a more complete example would be able to answer something like:

"Provide a list of comedies (movies) from released during the Vietnam war" (genre=comedy&&release_year in 1955-1975 "What objects in this collection are made of metal?" (filter=material in gold|silver|bronze)

It's possible that the correct solution to this is to ask the AI provider to return a meilisearch query (given the index settings). At least, right now that's how I might approach it.

In short, I'd like to see the example include some facet/filtering data, as often that data is very relevant. If genre isn't important to a movie, we can find other examples where there are important fields besides full-text ones.

tacman avatar Jun 04 '25 12:06 tacman

Thanks for the feedback tacman :+1:

We are currently working on systems that might come to detect and apply filters and genres automatically from the request, but they're not there yet :-)

dureuill avatar Jun 04 '25 12:06 dureuill

What do you think about using https://dummyjson.com/products for sample data?

I'm playing around with an importer now, but can switch to another dataset (e.g. kitchenware.json if we can find it).

My real-world applications are a database of news stories, songs and museum objects, but for a demo of the system they may be too context-specific. dummyjson data is pretty neutral.

tacman avatar Jun 04 '25 12:06 tacman

On a related note, suppose I use twig or some other templating engine to create my document? In that case, is the best practice to create a new field and set the documentTemplate to just that?

foreach ($qb->getMovies() as $movie) {
    $movie->setSummary(
      $this->twig->render("my_summary.txt.twig", ['movie'=>$movie])
    );
}

then set documentTemplate to simply {{ doc.summary }}?

tacman avatar Jun 04 '25 12:06 tacman

yes that's an option that some users pick. You might still want to limit the length of doc.summary, either by construction with your template engine, or by applying a liquid filter in the template:

{{doc.summary|truncatewords:40}}

dureuill avatar Jun 04 '25 13:06 dureuill

I'm starting to get results that make sense, I'll post a demo once I have a current meilisearch server that supports vectors. Locally it's working great.

So this works great for augmenting a search of a database of documents. How do I use meilisearch to query the ai provider with the RAG data but returning a narrative instead of a list of hits?

That is, if I've indexed local newspaper articles, and I want to ask "Give me a short summary of status of broadband access in Rappahannock County"

The documents here are newspaper articles ( https://www.rappnews.com/search/?tncms_csrf_token=9883c1b19b4c209feeb5c1666c1772ad1492527c2fb23d8b0ed2cb58c02af22f.0a6d17723efb329484fc&f=html&t=article%2Ccollection&sd=desc&l=25&nsa=eedition&q=cell+tower)

I can see how to return relevant hits (which is truly awesome).

I'm not sure how I can just say "augment the data from the database in your generative response".

With other systems, this involves breaking the document down into chunks, creating a vector and storing that in the database. In fact, neuron-ai can now use meilisearch as the vector store:

https://docs.neuron-ai.dev/components/vector-store#meilisearch

https://docs.neuron-ai.dev/rag

Gut feeling is that I need two vector-enabled indexes, one for the articles themselves created with meilisearch, the other that uses neuron-ai with meilisearch simply as the vector store, but with other embeddings.

tacman avatar Jun 04 '25 14:06 tacman

Hey @tacman 👋

I am very happy that you were able to set up a Meilisearch instance with embeddings and see the solution's potential. The RAG functionality you are talking about will be released soon (next Tuesday, June 10th) as an experimental feature. This chat completion feature aims to allow users to directly speak to their Meilisearch instance through an underlying LLM (OpenAI-compatible for now) and let it understand, format, and answer the user.

I would love your feedback on this feature once you can use it. Please take a look at this early usage page.

Have a nice day 🌵

Kerollmops avatar Jun 04 '25 16:06 Kerollmops

Be still my beating heart.

tacman avatar Jun 04 '25 16:06 tacman

The timing works out for the meilisearch-php version 2

I can imagine a few new classes corresponding to the openai-compatible endpoints/providers.

RAG + Meilisearch all in one library seems ideal.

tacman avatar Jun 04 '25 16:06 tacman

Thanks for opening the issue, @dureuill, and for the feedback, @tacman—the "Getting started with AI search" is targeted at users who don't have a lot of experience with using Meilisearch and AI/LLM tech, so knowing what's not working for you is super helpful. I'll definitely review this tutorial in the near future.

That said, the discussion surrounding RAGs is out of the original issue's scope. Perhaps we should move the conversation to a dedicated issue in the engine repo?

guimachiavelli avatar Jun 04 '25 17:06 guimachiavelli

good idea. RAG and semantic search are related but not the same.

tacman avatar Jun 04 '25 17:06 tacman