documentation icon indicating copy to clipboard operation
documentation copied to clipboard

Create a page: Indexing - best practice

Open qdequele opened this issue 4 years ago • 3 comments

Create a page about best practices for indexing long documents, multilingual content, multi-client, etc..

Indexing long documents.

Cut documents into smaller sizes.

If you are trying to index https://en.wikipedia.org/wiki/Paris, you can cut the page by title.

[
    {
        "id": "paris_en_1",
        "page": "paris_en",
        "section": "Etymology",
        "permalink": "https://en.wikipedia.org/wiki/Paris#Etymology",
        "content": "The name \"Paris\" is derived from its early inhabitants, the Celtic Parisii tribe.[14] The city's name is not related to the Paris of Greek mythology ...",
    },
    {
        "id": "paris_en_2",
        "page": "paris_en",
        "section": "History / Origins",
        "permalink": "https://en.wikipedia.org/wiki/Paris#Origins",
        "content": "The Parisii, a sub-tribe of the Celtic Senones, inhabited the Paris area from around the middle of the 3rd century BC.[20][21] One of the area's major ...",
    },
    {
        "id": "paris_en_3",
        "page": "paris_en",
        "section": "History / Middle Ages to Louis XIV",
        "permalink": "https://en.wikipedia.org/wiki/Paris#Middle_Ages_to_Louis_XIV",
        "content": "By the end of the 12th century, Paris had become the political, economic, religious, and cultural capital of France.[30] The Palais de la Cité, the royal ...",
    }
]

Thanks to that, you also can jump directly o the good part of the article with the permalink. It could be great to also use distinct on page to be sure only to have 1 response from this page.

Multilingual content

Instead of having all the attributes in the same document, it is often better to create an index by language.

Data:

[
    {
        "skuid": "SG87D56P",
        "brand": "Eastpak",
        "name_en": "Backpack Floid",
        "name_fr": "Sac à dos Floid",
        "name_es": "Mochila Floid",
        "description_en": "The quintessential backpack for the busy: Floid in Black combines style and substance with the ergonomic construction",
        "description_fr": "Le sac à dos parfait pour les journées chargées : le Floid en version Black combine style et fonctionnalité avec sa conception ergonomique",
        "description_es": "Nuestra mochila por excelencia para los más ocupados: la Floid en estilo Black combina estilo y esencia con un diseño ergonómico",
    }
]

Split this data into multiples indexes:

Index: shop_en

[
    {
        "skuid": "SG87D56P",
        "brand": "Eastpak",
        "name": "Backpack Floid",
        "description_en": "The quintessential backpack for the busy: Floid in Black combines style and substance with the ergonomic construction",
    }
]

Index: shop_fr

[
    {
        "skuid": "SG87D56P",
        "brand": "Eastpak",
        "name_fr": "Sac à dos Floid",
        "description_fr": "Le sac à dos parfait pour les journées chargées : le Floid en version Black combine style et fonctionnalité avec sa conception ergonomique",
    }
]

Index: shop_es

[
    {
        "skuid": "SG87D56P",
        "brand": "Eastpak",
        "name_es": "Mochila Floid",
        "description_es": "Nuestra mochila por excelencia para los más ocupados: la Floid en estilo Black combina estilo y esencia con un diseño ergonómico",
    }
]

Multi-client

What I call multi-client it's when you have in the same database documents of multiples persons. You don't want that a person takes a look at the content of another. In the future, we will have a more advanced authorization layer with a token that will automatically apply filters on data. But actually, the only way is to create one index per person.

Simplify your documents

Remove the maximum of useless fields on your documents. If you don't want to search on a field, print it on your application, make special rankings, or filtering on it, then remove it. The smaller your documents are, the fastest your search will be.

qdequele avatar Apr 24 '20 07:04 qdequele

On #432:

It would be great if we provide a guide page where we explain that splitting is the best way to deal with big documents. This is not the first time people try to index big documents in MeiliSearch and encounter issues. Splitting documents by paragraph or even by sub-title/content like for the MeiliSearch documentation is the best way to deal with that as it allows the engine to answer with more precise part of the original text. To do so the original documents must be split into shorter parts where the original document id should be kept and a distinctRule can be used on the original document id to make sure only the best part of the original documents is shown to the user.

tpayet avatar Jun 29 '20 14:06 tpayet

Concerning big document splitting.

It could be added in two places :

  1. In the FAQ as it appears to be a common question:
    • Why are my documents are not fully indexed?
    • Why is my dataset is not fully added?
  2. In the documents guide
The best way to upload a big dataset is to split it into smaller chunks. This will prevent data loss or potential memory crashes.
If your documents contain fields with a lot of data, consider splitting them by paraghraph or sub-title/contents. This allows the engine to answer with more precision.

bidoubiwa avatar Jun 29 '20 14:06 bidoubiwa

It should probably include the exact limit of characters or words.

Similar to issue #807, I'm also seeing attributes (less than both 7000 chars or 1000 words) that aren't entirely indexed. There are cases where simply splitting by paragraph isn't possible, and even if by paragraph, there's the possibility that long paragraphs will pass the limit and won't be fully indexed.

A known hard limit would help in terms of designing the best way to add documents with split content.

mikerogerz avatar Jun 29 '20 18:06 mikerogerz