sap-commerce-db-sync
sap-commerce-db-sync copied to clipboard
Migration tasks cannot restart when platform nodes crash
We have encountered the following situation when running commerce-db-sync on CCv2:
- a migration task was in progress
- platform node crashed and was restarted by the CCv2 orchestrator
- on platform restart,
CronJobManager#getRunningOrRestartedCronJobsForNodesets the migration cronjob to ABORTED - we tried to restart the cronjob from Backoffice (but a trigger ends up in the same situation)
- the cronjob starts but never does anything, just waits forever for a sync task to complete
On further investigation, we found out that the incremental job implementation is checking if there is an existing running migration and if yes, it just starts waiting for that to finish. But that running migration is not actually running, as the thread running it was long gone with the node restart.
When we run some DB checks, this is the data we find:
SELECT * FROM MIGRATIONTOOLKIT_TABLECOPYSTATUS
migrationId startAt endAt lastUpdate total completed failed status
b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7 2023-11-14 12:00:48.0833333 2023-11-14 12:00:50.616 1 0 0 RUNNING
SELECT * FROM MIGRATIONTOOLKIT_TABLECOPYTASKS
targetnodeId migrationId pipelinename sourcetablename targettablename columnmap duration sourcerowcount targetrowcount failure error published truncated lastupdate avgwriterrowthroughput avgreaderrowthroughput copymethod keycolumns durationinseconds
13 b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7 sapbwentry->SAPBWENTRY sapbwentry SAPBWENTRY {} 9183 0 0 0 0 2023-11-14 12:00:50.616 0.00 0.00 0.00
We have to go and manually change the status in these tables to error:
UPDATE MIGRATIONTOOLKIT_TABLECOPYTASKS SET failure=1 WHERE migrationid='b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7';
UPDATE MIGRATIONTOOLKIT_TABLECOPYSTATUS set status='ABORTED' where migrationid='b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7';
Then when we start the cronjob, it does not find a running task, it creates a new on and works correctly.
We implemented a local fix for this, which we can submit to the repo:
- we listen for
AfterCronJobCrashAbortEventthat is sent by theCronJobManager - when we get this event, we look for running tasks and mark them as aborted like above
Since we have a single DB migration possible, it works for us.
An even better approach would be to add on the MigrationCronJob a new attribute with the migrationId, so when we get the event that the cronjob was aborted, we can cancel only the needed migration task.
We've recently fixed some newly discovered issues regarding resuming of failed migration (#14), main issue was due to logic fetching pending failed copy tasks, with condition of cluster node ID, which was resulting invalid or even empty results, especially after node restart (if there are no fixed cluster ID assigned).
If you already have something implemented to handle cronjob restart case, please either create a PR here, or point me to your fork (if possible) where you added such change.
Adding migration ID to cronjob model data might be obviously quite useful in multiple cases, starting with proper abort/failure handling. We could also attach migration report log to cronjob execution log, or at least reference from job, to report download location via Backoffice UI component or something similar.