documentation icon indicating copy to clipboard operation
documentation copied to clipboard

Advise about big documents that should be split

Open Kerollmops opened this issue 4 years ago • 8 comments

It would be great if we can redirect users with big documents issues to a page that explain why and how to split big documents like books.

I already talked to someone about this MeiliSearch limitation and why he should split its articles by paragraph in an issue.

Kerollmops avatar Apr 19 '20 13:04 Kerollmops

Closing as this is mentioned under Known limitations

maryamsulemani97 avatar Apr 06 '22 11:04 maryamsulemani97

Not sure this covers the whole story, I explained in the linked message that splitting documents (even if they are not too big) will improve the relevancy of the engine. Splitting by paragraph will help the user find more specific parts of a book or a webpage (look at the documentation, it is split by sub-titles and paragraphs).

Kerollmops avatar Apr 06 '22 23:04 Kerollmops

page that explain why and how to split big documents like books.

This is a guide the core team should provide, reviewed by the docs-team of course, but for me it's too technical for a non-expert about relevancy to write this kind of guide on their own. Tell me if I'm wrong.

curquiza avatar Apr 07 '22 18:04 curquiza

Hello @meilisearch/engine-team 👋 Do we have an ETA for this guide?

maryamsulemani97 avatar Mar 21 '23 10:03 maryamsulemani97

I think there is a communication issue here since you are expecting us to react first 😇

No ETA at all, we did not get any green light from docs team: is it something the docs team would like to finally see in the docs and prioritize it?

Let's start from the beginning then 😊

@maryamsulemani97

  • Is it something you think is relevant to be in the documentation?
  • Do you need any clarification from Kero about what he expects?
  • Where do you want the guide to be put in the documentation?

@Kerollmops can you explain a little bit about what you would see in this guide: how long? which content? Any information that would help the documentation team to make decisions about it.

curquiza avatar Mar 21 '23 15:03 curquiza

Discussed with @guimachiavelli We close the issue since we did not get any answer for a while and we don't know how to solve this issue exactly.

@Kerollmops please, let us know if you think this issue should still be done by documentation team.

If yes, please, provide the information they need to do the work:

@Kerollmops can you explain a little bit about what you would see in this guide: how long? which content? Any information that would help the documentation team to make decisions about it.

curquiza avatar Oct 31 '23 12:10 curquiza

Hey everybody 👋

I'm sorry for not getting back to you sooner. We should provide a guide explaining the right way to index large documents.

Meilisearch is better at handling paragraph-sized documents. The reason is related to the different ranking rules we use. Meilisearch will always show documents containing the nearest list of words compared to the query, which can be at the end of a document. Splitting your documents will increase the quality of the results by making Meilisearch always match part of it, never too far at the end.

At the same time, reducing the size of your documents will reduce the time it takes Meilisearch to return them. It's more of a network time issue, but it is part of the user experience, so it's essential. We have examples of query times going up to a few seconds when documents are 8MiB on average.

We should guide users into splitting documents by paragraph or a few sentences. Identify those new documents with a unique ID. They can auto-generate them by using a UUID, for example. Associate each of the newly generated paragraph documents with the original book or document ID and set it as a DistinctAttribute in the settings. Now, the users must search for their document, and they will always see the single best result in every book they indexed.

Kerollmops avatar Nov 02 '23 15:11 Kerollmops

I had planned to fix this issue this week with a single paragraph in the storage or indexing page, but after further consideration I think we can only properly explain this with a guide. I'm currently busy with quite a few new articles, so I'll delay it until later in the quarter.

In the meantime, for my future self: show users how to split a book dataset in a js application.

outline:

  1. intro
  2. requirements
  3. what counts as a large document?
  • around 8mb
  1. step 1: split documents with large fields per paragraph
  • aim towards a maximum of 200-400 words/paragraph (confirm actual number with SME)
  1. step 2: generate UUIDs
  • let uuid = ["hello /r/n cruel /r/n world"].split(" /r/n ").map((value, key) => key)
  1. step 3: set UUID as distinct attribute
  2. step 4: search
  3. conclusion

guimachiavelli avatar Feb 15 '24 17:02 guimachiavelli