Pull Request

Related issue

Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760

What does this PR do?

User standpoint

Meilisearch no longer fails tasks that encounter the milli::UserError(MaxDatabaseSizeReached) error.
Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred.

Implementation standpoint

Add Batch::index_uid to get the index_uid of a batch of task if there is one
IndexMapper::create_or_open_index now takes an additional size argument that allows to (re)open indexes with a size different from the base IndexScheduler::index_size field
IndexScheduler::tick now returns a Result<TickOutcome> instead of a Result<usize>. This offers more explicit control over what the behavior should be wrt the next tick.
Add IndexStatus::BeingResized that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again.
Add IndexMapper::resize_index to increase the size of an index.
In IndexScheduler::tick, intercept task batches that failed due to MaxDatabaseSizeReached and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task.

Testing the PR

The following diff can be applied to this branch to make testing the PR easier:

diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs
index 553ab45a..022b2f00 100644
--- a/index-scheduler/src/index_mapper.rs
+++ b/index-scheduler/src/index_mapper.rs
@@ -228,13 +228,15 @@ impl IndexMapper {
 
         drop(lock);
 
+        std::thread::sleep_ms(2000);
+
         let current_size = index.map_size()?;
         let closing_event = index.prepare_for_closing();
-        log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
+        log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
 
         closing_event.wait();
 
-        log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
+        log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
 
         let index_path = self.base_path.join(uuid.to_string());
         let index = self.create_or_open_index(&index_path, None, 2 * current_size)?;
@@ -268,8 +270,10 @@ impl IndexMapper {
             match index {
                 Some(Available(index)) => break index,
                 Some(BeingResized(ref resize_operation)) => {
+                    log::error!("waiting for resize end");
                     // Deadlock: no lock taken while doing this operation.
                     resize_operation.wait();
+                    log::error!("trying our luck again!");
                     continue;
                 }
                 Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())),
diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs
index 11b17d05..242dc095 100644
--- a/index-scheduler/src/lib.rs
+++ b/index-scheduler/src/lib.rs
@@ -908,6 +908,7 @@ impl IndexScheduler {
     ///
     /// Returns the number of processed tasks.
     fn tick(&self) -> Result<TickOutcome> {
+        log::error!("ticking!");
         #[cfg(test)]
         {
             *self.run_loop_iteration.write().unwrap() += 1;
diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs
index 050c825a..63f312f6 100644
--- a/meilisearch/src/main.rs
+++ b/meilisearch/src/main.rs
@@ -25,7 +25,7 @@ fn setup(opt: &Opt) -> anyhow::Result<()> {
 
 #[actix_web::main]
 async fn main() -> anyhow::Result<()> {
-    let (opt, config_read_from) = Opt::try_build()?;
+    let (mut opt, config_read_from) = Opt::try_build()?;
 
     setup(&opt)?;
 
@@ -56,6 +56,8 @@ We generated a secure master key for you (you can safely copy this token):
         _ => (),
     }
 
+    opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap();
+
     let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?;
 
     #[cfg(all(not(debug_assertions), feature = "analytics"))]

Mainly, these debug changes do the following:

Set the default index size to 1MiB so that index resizes are initially frequent
Turn some logs from info to error so that they can be displayed with --log-level ERROR (hiding the other infos)
Add a long sleep between the beginning and the end of the resize so that we can observe the BeingResized index status (otherwise it would never come up in my tests)

Open questions

Is the growth factor of x2 the correct solution? For a Vec in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB.

PR checklist

Please check if your PR fulfills the following requirements:

[ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
[ ] Have you read the contributing guidelines?
[ ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Jan 10 '23 10:01 dureuill

Pushed an update to handle the case where an error (eg IoError) would leave the hashmap in an inconsistent state.

Jan 10 '23 16:01 dureuill

I feel that with the tests I did, and the that that milli is now in main, this PR is now ready for review.

Jan 30 '23 09:01 dureuill

Uffizzi Preview deployment-14597 was deleted.

Feb 07 '23 13:02 github-actions[bot]

Do you think there is a way to test this behaviour, though?

It should really be in this PR, but #3331 adds automatic tests of index resizing.

Feb 14 '23 08:02 dureuill

Build succeeded:

Feb 20 '23 15:02 bors[bot]

Transparently resize indexes on MaxDatabaseSizeReached errors

Pull Request

Related issue

What does this PR do?

User standpoint

Implementation standpoint

Testing the PR

Open questions

PR checklist