sap-commerce-db-sync icon indicating copy to clipboard operation
sap-commerce-db-sync copied to clipboard

Migration tasks cannot restart when platform nodes crash

Open iccaprar opened this issue 1 year ago • 2 comments

We have encountered the following situation when running commerce-db-sync on CCv2:

  • a migration task was in progress
  • platform node crashed and was restarted by the CCv2 orchestrator
  • on platform restart, CronJobManager#getRunningOrRestartedCronJobsForNode sets the migration cronjob to ABORTED
  • we tried to restart the cronjob from Backoffice (but a trigger ends up in the same situation)
  • the cronjob starts but never does anything, just waits forever for a sync task to complete

On further investigation, we found out that the incremental job implementation is checking if there is an existing running migration and if yes, it just starts waiting for that to finish. But that running migration is not actually running, as the thread running it was long gone with the node restart.

When we run some DB checks, this is the data we find:

SELECT * FROM MIGRATIONTOOLKIT_TABLECOPYSTATUS

migrationId	startAt	endAt	lastUpdate	total	completed	failed	status
b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7	2023-11-14 12:00:48.0833333		2023-11-14 12:00:50.616	1	0	0	RUNNING

SELECT * FROM MIGRATIONTOOLKIT_TABLECOPYTASKS

targetnodeId	migrationId	pipelinename	sourcetablename	targettablename	columnmap	duration	sourcerowcount	targetrowcount	failure	error	published	truncated	lastupdate	avgwriterrowthroughput	avgreaderrowthroughput	copymethod	keycolumns	durationinseconds
13	b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7	sapbwentry->SAPBWENTRY	sapbwentry	SAPBWENTRY	{}		9183	0	0		0	0	2023-11-14 12:00:50.616	0.00	0.00			0.00

We have to go and manually change the status in these tables to error:

UPDATE MIGRATIONTOOLKIT_TABLECOPYTASKS SET failure=1 WHERE migrationid='b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7';

UPDATE MIGRATIONTOOLKIT_TABLECOPYSTATUS set status='ABORTED' where migrationid='b3be66a1-e52d-44c0-bc73-f84ed4e5f4e7';

Then when we start the cronjob, it does not find a running task, it creates a new on and works correctly.

iccaprar avatar Jan 29 '24 15:01 iccaprar

We implemented a local fix for this, which we can submit to the repo:

  • we listen for AfterCronJobCrashAbortEvent that is sent by the CronJobManager
  • when we get this event, we look for running tasks and mark them as aborted like above

Since we have a single DB migration possible, it works for us.

An even better approach would be to add on the MigrationCronJob a new attribute with the migrationId, so when we get the event that the cronjob was aborted, we can cancel only the needed migration task.

iccaprar avatar Jan 29 '24 15:01 iccaprar

We've recently fixed some newly discovered issues regarding resuming of failed migration (#14), main issue was due to logic fetching pending failed copy tasks, with condition of cluster node ID, which was resulting invalid or even empty results, especially after node restart (if there are no fixed cluster ID assigned).

If you already have something implemented to handle cronjob restart case, please either create a PR here, or point me to your fork (if possible) where you added such change.

Adding migration ID to cronjob model data might be obviously quite useful in multiple cases, starting with proper abort/failure handling. We could also attach migration report log to cronjob execution log, or at least reference from job, to report download location via Backoffice UI component or something similar.

lnowakowski avatar Feb 14 '24 09:02 lnowakowski