milli icon indicating copy to clipboard operation
milli copied to clipboard

Change update file format from (OBKV + Fields Ids Map) to a grenad of Json values

Open loiclec opened this issue 3 years ago • 1 comments
trafficstars

Pull Request

NOTE: this PR is based on #561

What does this PR do?

It is mostly a refactor with a goal of simplifying the code in milli/src/documents and Transform::read_documents.

Previously, when meilisearch received a document addition task, it would save the new documents to disk as OBKVs where the keys are field ids and the values are the corresponding Json values. Since the field ids in that OBKV were not the same as the field ids in the milli database, it would also save a FieldsIdsMap to associate each field id with its original key name and vice-versa. When milli read that update file, it had to do merge the database's field ids map with the update file's field ids map. It would also need to convert the OBKV of each document back to Json in order to flatten it, and then back to OBKV. It was a bit complicated.

With this PR, the new documents are simply saved in a grenad as Json values, which is conceptually simpler: we don't convert back and forth between OBKV and Json, or deal with different field ids maps. I have also tried to comment the code and rename a few things to make it easier to understand.

flatten_serde_json

I have updated this crate so that it returns a Cow<serde_json::Value> instead of a serde_json::Value. The idea is to return a Cow::Borrowed(original_doument) whenever the object didn't need to be flattened. This can be good for performance.

DocumentVisitor

I saw that @Kerollmops removed the serde_impl.rs file. Unfortunately, I think we need to keep the functionality offered by that file. Without it, in order to process a document addition in the .json format, we need to hold the entire payload (Vec<u8>) in memory, and then deserialise it to a Vec<Object>. At that point, we are using more than 2x the size of the new documents in RAM. We don't have a benchmark that stresses that use case, but it is a problem nonetheless.

So I rewrote the serde_impl.rs file into document_visitor.rs. Since the update file format was simplified, the code is simpler now. I have also tried to comment it well. Basically, the idea is to:

  1. visit the Json array using serde
  2. write each visited document to the update file directly

Guessing the primary key

Previously, if a primary key was not specified by the user, we would look in the FieldsIdsMap to find a field that looks like a primary key. With this PR, since we no longer have access the fields id map, only the fields of the very first document are considered as potential primary keys.

loiclec avatar Jul 05 '22 12:07 loiclec

Hey!

Thank you, that is a great PR 🎉

[About the serde_impl.rs file] Unfortunately, I think we need to keep the functionality offered by that file.

Indeed, I wanted to simplify the system by removing this serde trait implementation. I love the fact that you use the DocumentVisitor to directly write into the DocumentBatchVisitor it is smart!

[About guessing the Primary Key] With this PR, since we no longer have access the fields id map, only the fields of the very first document are considered as potential primary keys.

I hope it will work with the auto-generated documents ids too 😃

Kerollmops avatar Jul 05 '22 12:07 Kerollmops

This didn't turn out to be good in terms of performance, so I'm dropping this PR for now.

loiclec avatar Aug 18 '22 12:08 loiclec