gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Error out if any segment has a running basebackup

Open Annu149 opened this issue 3 years ago • 1 comments

gprecoverseg, gpmovemirrors should show warning and gpexpand should raise an exception if there is already a running instance of pg_basebackup for the segment that needs to be recovered/moved.

How this issue was produced:

  • gprecoverseg -F was called to bring up down segments
  • This gprecoverseg call was interrupted because of a disconnected putty session
  • Because of this, the main gprecoverseg process was killed but the background pg_basebackup process was still running
  • After this, gprecoverseg -F was run multiple times, all of which failed because of multiple parallel basebackup processes running per segment.
  • One of the gprecoverseg ran till completion before running any other gprecoverseg operation. Once this basebackup process was finished, the background code also automatically started the segment
  • Then gprecoverseg -ar was run which caused a PANIC because the multiple basebackups in parallel had already corrupted the data dir

Fix:

Create a list segments_with_running_basebackup of segment dbids for which pg_basebackup is running and while creating the recovery triplets, check if the segment to be recovered/moved/added(expansion segments) is already present in the segments_with_running_basebackup list. If that's the case give warning or raise exception and terminate the utility execution.

Behaviour of gprecoverseg - Warning message is logged containing list of segment contentIds with running basebackup and for the rest of the segments gprecoverseg will proceed with recovery

Behaviour of gpmovemirrors - Warning message is logged containing list of segment contentIds with running basebackup and for the rest of the mirrors gpmovemirrors will proceed with movement

Behaviour of gpexpand - Exception is raised and execution is stopped immediately if segments to be added have running pg_basebackup

Behaviour of gpaddmirrors - As gpaddmirrors does not use force-overwrite flag with pg_basebackup to sync newly added mirrors with primary, the check is not required here. Have added behave test case to ensure it fails when mirror directories are not empty

Limitation of above fix:

To get the list of segments with running pg_basebackup we are querying gp_stat_replication table, At present this table shows only content ID of the source segment for any running pg_basebackup process and no information about the target segment/DataDirectory

Added behave testcase for gprecoverseg, gpmovemirrors and gpaddmirrors

Here are some reminders before you submit the pull request

  • [ ] Add tests for the change
  • [ ] Document changes
  • [ ] Communicate in the mailing list if needed
  • [ ] Pass make installcheck
  • [ ] Review a PR in return to support the community

Annu149 avatar Jun 06 '22 13:06 Annu149

Had sync with @kaknikhil, here is what we discussed:

  • Change the test so that it ignores the segments with running basebackup but brings up others.
  • Change the PR/commit description to explain the jira issue, followed by how the PR fixes it and what are the limitations
  • Document the findings from manual tests (from the story)
  • other PR comments

Annu149 avatar Aug 01 '22 13:08 Annu149