milli icon indicating copy to clipboard operation
milli copied to clipboard

Bootstrapping milli

Open MarinPostma opened this issue 3 years ago • 1 comments

Currently, upgrading from one breaking version of milli to another requires creating a dump and re-importing it. This is not the best user experience, and it would be desirable to improve it.

Bootstrapping milli would mean automatically triggering a reindexing when a breaking version change is detected. On milli's side there are three invariants that need to hold for it to be possible:

  • the document database needs to be stable. This doesn't seem to be an issue, since the database hasn't changed in a while, and probably never will.
  • The field id to field name mapping. This one hasn't changed in a while either, so we can probably assume its stability.
  • The settings stability, this one isn't as clear. The settings are stored in a polymorphic database, and its stability is not straightforward.

A solution would be to store user settings in a serialized structure rather than having each separate field in a database. This struct could embed a version number that would allow us to break in the future, while permitting us to add new settings without breaking anything, replacing missing fields with the default value from one version to the other.

From the looks of it, it doesn't seem like a very complicated work, and it would definitely bring value.

MarinPostma avatar Mar 24 '22 10:03 MarinPostma

I agree with your point of view regarding the lack of ease of upgrading of the engine. I will just comment some points here:

the document database needs to be stable. This doesn't seem to be an issue, since the database hasn't changed in a while, and probably never will.

The documents database, which is currently a raw LMDB database, is one of the biggest databases that we have right now and if we want to reduce the disk usage of Meilisearch we will probably change the format of it by using a grenad segment oriented data structure. By doing so we will be able to compress the raw user documents without impacting the search time as a search typically returns 20-40 documents at a time.

For a 80m songs database

As you can see, the documents database is 11.11 GiB for 80 million and 100 documents. It is the biggest database here. When the song.csv.gz with 115 million documents is 2.3GiB big. By doing the sum it seems like it weighs 32.38GiB.

The main database weigh:
	total key size: 516 B
	total val size: 130.44 MiB
	total size: 130.44 MiB
	number of entries: 24
The word-docids database weigh:
	total key size: 28.45 MiB
	total val size: 1.60 GiB
	total size: 1.63 GiB
	number of entries: 3276632
The word-prefix-docids database weigh:
	total key size: 37.22 KiB
	total val size: 2.86 GiB
	total size: 2.86 GiB
	number of entries: 10896
The docid-word-positions database weigh:
	total key size: 5.61 GiB
	total val size: 2.75 GiB
	total size: 8.36 GiB
	number of entries: 674516404
The word-pair-proximity-docids database weigh:
	total key size: 999.45 MiB
	total val size: 6.73 GiB
	total size: 7.71 GiB
	number of entries: 80185705
The word-prefix-pair-proximity-docids database weigh:
	total key size: 444.52 MiB
	total val size: 9.70 GiB
	total size: 10.14 GiB
	number of entries: 43943290
The word-position-docids database weigh:
	total key size: 100.74 MiB
	total val size: 2.14 GiB
	total size: 2.24 GiB
	number of entries: 8916197
The word-prefix-position-docids database weigh:
	total key size: 6.67 MiB
	total val size: 5.03 GiB
	total size: 5.04 GiB
	number of entries: 950716
The field-id-word-count-docids database weigh:
	total key size: 90 B
	total val size: 178.49 MiB
	total size: 178.49 MiB
	number of entries: 30
The facet-id-f64-docids database weigh:
	total key size: 0 B
	total val size: 0 B
	total size: 0 B
	number of entries: 0
The facet-id-string-docids database weigh:
	total key size: 30.70 MiB
	total val size: 3.26 GiB
	total size: 3.29 GiB
	number of entries: 1851821
The field-id-docid-facet-f64s database weigh:
	total key size: 0 B
	total val size: 0 B
	total size: 0 B
	number of entries: 0
The field-id-docid-facet-strings database weigh:
	total key size: 4.70 GiB
	total val size: 2.73 GiB
	total size: 7.43 GiB
	number of entries: 353564264
The documents database weigh:
	total key size: 305.18 MiB
	total val size: 10.82 GiB
	total size: 11.11 GiB
	number of entries: 80000100

A solution would be to store user settings in a serialized structure rather than having each separate field in a database. This struct could embed a version number that would allow us to break in the future, while permitting us to add new settings without breaking anything, replacing missing fields with the default value from one version to the other.

You don't need to add a version number in the settings, we can use the version of Meilisearch for that and we already have access to it. However, using JSON instead of split settings could help a lot for retro-compatibility.

Kerollmops avatar Mar 24 '22 11:03 Kerollmops

Having milli auto-reindex the database would be extremely helpful when embedding milli, as I currently need to manually write migration code and bundle every version of milli alongside my library. Related issues:

  • https://github.com/GregoryConrad/mimir/issues/9
  • for the settings struct https://github.com/meilisearch/meilisearch/issues/3366
  • https://github.com/meilisearch/meilisearch/issues/2570

GregoryConrad avatar Dec 03 '22 20:12 GregoryConrad

Closed in favor of https://github.com/meilisearch/meilisearch/issues/2570

curquiza avatar Jan 16 '23 17:01 curquiza