rabbitmq-server
rabbitmq-server copied to clipboard
khepri: failed migration to Khepri may leave the node in an usuable state
Note: this issue is related to an unreleased feature
If the Mnesia->Khepri fails for any reason, upon the next startup, the metadata_store tries to load the migrated part of the metadata. I believe that if the feature flag is disabled, we should delete the coordination folder during metadata_store initialization, because it is either empty or it contains leftovers from a failed migration (are there any other situations?).
In my case the problem was that I ran out of memory during the migration. On the next boot, we'd try to load the migrated part of the data into Khepri/memory again, and run out of memory again leading to a crash loop. If the migration failed, we should start fresh, still using Mnesia. The FF was still reported as disabled in my case, so going back to Mnesia should not be a problem.
I added a change to clear the metadata store on startup if khepri is disabled: https://github.com/rabbitmq/rabbitmq-server/commit/d0c6246d0917d7275985ea605d980b5eea26d5f8
Although I'm not sure it will address the issue of data being loaded in memory, as this is done after starting the ra system. @dumbbell do you see any issue with clearing khepri data there?
I killed beam during the migration and RabbitMQ won't start now. Another timeout we need to ignore I guess:
dets: file "/home/mkuratczyk/data/rabbit@kura/mnesia/rabbit@kura/coordination/rabbit@kura/meta.dets" not properly closed, repairing ...
BOOT FAILED
===========
Exception during startup:
error:{badmatch,{error,{timeout,{metadata_store,rabbit@kura}}}}
rabbit_khepri:setup/1, line 112
rabbit:run_prelaunch_second_phase/0, line 382
rabbit:start/2, line 847
application_master:start_it_old/4, line 293
Retested a year later: I believe we can close this. As far as I can tell, these days, if khepri_db isn't enabled, then it is not started. Therefore, a failed migration should not trigger a crash loop, since the system should start with Mnesia again and should not load any data into Khepri.
I also tried to migrate a queue that no longer exists:
- declare a queue
- import a lot of queues
- trigger migration and kill it before it finishes
- start the node again
- delete the queue
- enable khepri again
I suspected that perhaps the queue would be there after the migration but no.
@mkuratczyk: I don't understand the second part of your test when you say "I suspected that perhaps the queue would be there after the migration but no.". Could you please expand on what you expected and what happened?
The actual behaviour is correct (at least I couldn't trigger a problem). The idea for what could go wrong was:
- queue
fooexists in Mnesia - migration is triggered
- migration is killed after
foois inserted into Khepri, but before migration finishes and deleted all data in Mnesia - queue
foois deleted in Mnesia - migration is started again and completes
- is
fooin Khepri or not?
It's not, at least in my testing, which is the correct behaviour since it doesn't exist at all during the migration that completed. I was just testing whether leftovers from an unsuccessful migration could lead to a non-existing queue being recreated.