rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

khepri: failed migration to Khepri may leave the node in an usuable state

Open mkuratczyk opened this issue 3 years ago • 2 comments

Note: this issue is related to an unreleased feature

If the Mnesia->Khepri fails for any reason, upon the next startup, the metadata_store tries to load the migrated part of the metadata. I believe that if the feature flag is disabled, we should delete the coordination folder during metadata_store initialization, because it is either empty or it contains leftovers from a failed migration (are there any other situations?).

In my case the problem was that I ran out of memory during the migration. On the next boot, we'd try to load the migrated part of the data into Khepri/memory again, and run out of memory again leading to a crash loop. If the migration failed, we should start fresh, still using Mnesia. The FF was still reported as disabled in my case, so going back to Mnesia should not be a problem.

mkuratczyk avatar Jul 26 '22 18:07 mkuratczyk

I added a change to clear the metadata store on startup if khepri is disabled: https://github.com/rabbitmq/rabbitmq-server/commit/d0c6246d0917d7275985ea605d980b5eea26d5f8

Although I'm not sure it will address the issue of data being loaded in memory, as this is done after starting the ra system. @dumbbell do you see any issue with clearing khepri data there?

dcorbacho avatar Aug 29 '22 11:08 dcorbacho

I killed beam during the migration and RabbitMQ won't start now. Another timeout we need to ignore I guess:

dets: file "/home/mkuratczyk/data/rabbit@kura/mnesia/rabbit@kura/coordination/rabbit@kura/meta.dets" not properly closed, repairing ...

BOOT FAILED
===========
Exception during startup:

error:{badmatch,{error,{timeout,{metadata_store,rabbit@kura}}}}

    rabbit_khepri:setup/1, line 112
    rabbit:run_prelaunch_second_phase/0, line 382
    rabbit:start/2, line 847
    application_master:start_it_old/4, line 293

mkuratczyk avatar Aug 31 '22 07:08 mkuratczyk

Retested a year later: I believe we can close this. As far as I can tell, these days, if khepri_db isn't enabled, then it is not started. Therefore, a failed migration should not trigger a crash loop, since the system should start with Mnesia again and should not load any data into Khepri.

I also tried to migrate a queue that no longer exists:

  1. declare a queue
  2. import a lot of queues
  3. trigger migration and kill it before it finishes
  4. start the node again
  5. delete the queue
  6. enable khepri again

I suspected that perhaps the queue would be there after the migration but no.

mkuratczyk avatar Sep 12 '23 19:09 mkuratczyk

@mkuratczyk: I don't understand the second part of your test when you say "I suspected that perhaps the queue would be there after the migration but no.". Could you please expand on what you expected and what happened?

dumbbell avatar Sep 22 '23 10:09 dumbbell

The actual behaviour is correct (at least I couldn't trigger a problem). The idea for what could go wrong was:

  1. queue foo exists in Mnesia
  2. migration is triggered
  3. migration is killed after foo is inserted into Khepri, but before migration finishes and deleted all data in Mnesia
  4. queue foo is deleted in Mnesia
  5. migration is started again and completes
  6. is foo in Khepri or not?

It's not, at least in my testing, which is the correct behaviour since it doesn't exist at all during the migration that completed. I was just testing whether leftovers from an unsuccessful migration could lead to a non-existing queue being recreated.

mkuratczyk avatar Sep 22 '23 10:09 mkuratczyk