rabbitmq-server khepri: failed migration to Khepri may leave the node in an usuable state

Note: this issue is related to an unreleased feature

If the Mnesia->Khepri fails for any reason, upon the next startup, the metadata_store tries to load the migrated part of the metadata. I believe that if the feature flag is disabled, we should delete the coordination folder during metadata_store initialization, because it is either empty or it contains leftovers from a failed migration (are there any other situations?).

In my case the problem was that I ran out of memory during the migration. On the next boot, we'd try to load the migrated part of the data into Khepri/memory again, and run out of memory again leading to a crash loop. If the migration failed, we should start fresh, still using Mnesia. The FF was still reported as disabled in my case, so going back to Mnesia should not be a problem.

Jul 26 '22 18:07 mkuratczyk

I added a change to clear the metadata store on startup if khepri is disabled: https://github.com/rabbitmq/rabbitmq-server/commit/d0c6246d0917d7275985ea605d980b5eea26d5f8

Although I'm not sure it will address the issue of data being loaded in memory, as this is done after starting the ra system. @dumbbell do you see any issue with clearing khepri data there?

Aug 29 '22 11:08 dcorbacho

I killed beam during the migration and RabbitMQ won't start now. Another timeout we need to ignore I guess:

dets: file "/home/mkuratczyk/data/rabbit@kura/mnesia/rabbit@kura/coordination/rabbit@kura/meta.dets" not properly closed, repairing ...

BOOT FAILED
===========
Exception during startup:

error:{badmatch,{error,{timeout,{metadata_store,rabbit@kura}}}}

    rabbit_khepri:setup/1, line 112
    rabbit:run_prelaunch_second_phase/0, line 382
    rabbit:start/2, line 847
    application_master:start_it_old/4, line 293

Aug 31 '22 07:08 mkuratczyk

Retested a year later: I believe we can close this. As far as I can tell, these days, if khepri_db isn't enabled, then it is not started. Therefore, a failed migration should not trigger a crash loop, since the system should start with Mnesia again and should not load any data into Khepri.

I also tried to migrate a queue that no longer exists:

declare a queue
import a lot of queues
trigger migration and kill it before it finishes
start the node again
delete the queue
enable khepri again

I suspected that perhaps the queue would be there after the migration but no.

Sep 12 '23 19:09 mkuratczyk

@mkuratczyk: I don't understand the second part of your test when you say "I suspected that perhaps the queue would be there after the migration but no.". Could you please expand on what you expected and what happened?

Sep 22 '23 10:09 dumbbell

The actual behaviour is correct (at least I couldn't trigger a problem). The idea for what could go wrong was:

queue foo exists in Mnesia
migration is triggered
migration is killed after foo is inserted into Khepri, but before migration finishes and deleted all data in Mnesia
queue foo is deleted in Mnesia
migration is started again and completes
is foo in Khepri or not?

It's not, at least in my testing, which is the correct behaviour since it doesn't exist at all during the migration that completed. I was just testing whether leftovers from an unsuccessful migration could lead to a non-existing queue being recreated.

Sep 22 '23 10:09 mkuratczyk

rabbitmq-server rabbitmq-server copied to clipboard

khepri: failed migration to Khepri may leave the node in an usuable state

rabbitmq-server
rabbitmq-server copied to clipboard