gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Improving gprecoverseg performance by parallel execution of checks

Open piyushc01 opened this issue 2 years ago • 11 comments

Previously in gprecoverseg when we try to stop the segments it was taking too long to get and verify the list of postmaster processes on segments as this part was mostly executed in serial fashion. _get_running_postgres_segments which is executing several tasks and fetching information and that part is in serial fashion which needs to be changed. Currently _get_running_postgres_segments performs following tasks (of which many are on remote host):

  1. dereference the sym-link on remote host by getting realpath
  2. From remote hosts, get pid of the process from postmaster file
  3. Check on remote host if the above pid exists by executing kill -0 <pid>
  4. Check if the given PID is of a postmaster process (on remote host)

gprecoverseg does all the above checks on PID before going to stop the segment to make sure that we are stopping the correct segment and the PID mentioned is actually running on the system and the process exists. This is because in case of failures, the process might not exists on the machine. And if we go to stop the instances which are already stopped, pg_ctl will throw the error.

This current PR covers following:

  • After the current changes,dereferencing, getting pid from postmaster file, checking If process exists and making sure that PID is of a postmaster process will be executed in parallel on all the segments.
  • Unit test cases for the newly added functions
  • Added get function to worker pool to return the command-name which helps in co-relating the command with the segment when getting is from worker pool.
  • Also some IDE added visual enhancement to code like adding new line, introducing space etc.

Here are some reminders before you submit the pull request

  • [ ] Add tests for the change
  • [ ] Document changes
  • [ ] Communicate in the mailing list if needed
  • [ ] Pass make installcheck
  • [ ] Review a PR in return to support the community

piyushc01 avatar Jul 18 '22 11:07 piyushc01

  • Always helpful practice to follow to have code indentation fixes to be in separate commit from enhancements/fixes
  • This enhancement of functionality of infra "Added get function to worker pool to return the command-name which helps in co-relating the command with the segment when getting is from worker pool." should also be separate commit (like even if the code using this functionality needs to be reverted for some reason, this can remain in repository). Hence its important to have smaller granular functional commits.

ashwinstar avatar Jul 22 '22 00:07 ashwinstar

Please can you high-level explain why gprecoverseg goes at this length to check if postmaster is running for the given data directory by parsing the PID file and all such (ideally all this PID file format and all should be internal details of the Server code)? Why not gprecoverseg just issue pg_ctl stop irrespective and ignore if failure is due to

ashwin@innovate:~/workspace/gpdb$ pg_ctl -D <data_dir> stop
pg_ctl: PID file "<data_dir>/postmaster.pid" does not exist
Is server running?

ashwinstar avatar Jul 22 '22 00:07 ashwinstar

Please can you high-level explain why gprecoverseg goes at this length to check if postmaster is running for the given data directory by parsing the PID file and all such (ideally all this PID file format and all should be internal details of the Server code)? Why not gprecoverseg just issue pg_ctl stop irrespective and ignore if failure is due to

ashwin@innovate:~/workspace/gpdb$ pg_ctl -D <data_dir> stop
pg_ctl: PID file "<data_dir>/postmaster.pid" does not exist
Is server running?

@ashwinstar I have updated the PR description with more details. That's a good idea to improve performance further. We can think of offloading this check to pg_ctl and can reduce processing in gprecoverseg. We can explore more in that direction.

piyushc01 avatar Jul 26 '22 12:07 piyushc01

  • Always helpful practice to follow to have code indentation fixes to be in separate commit from enhancements/fixes
  • This enhancement of functionality of infra "Added get function to worker pool to return the command-name which helps in co-relating the command with the segment when getting is from worker pool." should also be separate commit (like even if the code using this functionality needs to be reverted for some reason, this can remain in repository). Hence its important to have smaller granular functional commits.

Sure, will try to follow this practice.

piyushc01 avatar Jul 26 '22 12:07 piyushc01

gprecoverseg does all the above checks on PID before going to stop the segment to make sure that we are stopping the correct segment and the PID mentioned is actually running on the system and the process exists. This is because in case of failures, the process might not exists on the machine. And if we go to stop the instances which are already stopped, pg_ctl will throw the error.

I still don't feel the need for this PR. Let's remove the unnecessary functionality instead of optimizing the unnecessary. Did you explore the option to catch someway the server not running error from pg_ctl and take all this logic out?

ashwinstar avatar Aug 02 '22 19:08 ashwinstar

gprecoverseg does all the above checks on PID before going to stop the segment to make sure that we are stopping the correct segment and the PID mentioned is actually running on the system and the process exists. This is because in case of failures, the process might not exists on the machine. And if we go to stop the instances which are already stopped, pg_ctl will throw the error.

I still don't feel the need for this PR. Let's remove the unnecessary functionality instead of optimizing the unnecessary. Did you explore the option to catch someway the server not running error from pg_ctl and take all this logic out?

I tried on my local machine by removing this process check and it is required because same gprecoverseg is used in rebalance operation also. In rebalance we have stopped the segments already and then initiate the gprecoverseg command again to switch the roles. During this step as we have already stopped the segments, again we trying to stop the process but missing postmaster.pid file makes sure we aren to stopping them. As while we are ignoring the errors, but we try to stop the segments again is resulting in few of the behave tests failure where it is expected that we are stopping segments only once. So removing this check of process before stopping is not a good idea at the moment. But in future when we refactor the rebalance recovery operation, we can take care of this.

piyushc01 avatar Sep 02 '22 11:09 piyushc01

gprecoverseg does all the above checks on PID before going to stop the segment to make sure that we are stopping the correct segment and the PID mentioned is actually running on the system and the process exists. This is because in case of failures, the process might not exists on the machine. And if we go to stop the instances which are already stopped, pg_ctl will throw the error.

I still don't feel the need for this PR. Let's remove the unnecessary functionality instead of optimizing the unnecessary. Did you explore the option to catch someway the server not running error from pg_ctl and take all this logic out?

I tried on my local machine by removing this process check and it is required because same gprecoverseg is used in rebalance operation also. In rebalance we have stopped the segments already and then initiate the gprecoverseg command again to switch the roles. During this step as we have already stopped the segments, again we trying to stop the process but missing postmaster.pid file makes sure we aren to stopping them. As while we are ignoring the errors, but we try to stop the segments again is resulting in few of the behave tests failure where it is expected that we are stopping segments only once. So removing this check of process before stopping is not a good idea at the moment. But in future when we refactor the rebalance recovery operation, we can take care of this.

That finding reinforces that current separate logic to check for postmaster.pid file is not required. pg_ctl itself is able to convey that information (via error message) if the file is not present instead of checking for it. So, if error is received checking the error message should inform the failure is due to file not existing, instead of separately checking for it. Hence, that makes it more strong argument to not do separate check for the file.

ashwinstar avatar Sep 02 '22 16:09 ashwinstar

gprecoverseg does all the above checks on PID before going to stop the segment to make sure that we are stopping the correct segment and the PID mentioned is actually running on the system and the process exists. This is because in case of failures, the process might not exists on the machine. And if we go to stop the instances which are already stopped, pg_ctl will throw the error.

I still don't feel the need for this PR. Let's remove the unnecessary functionality instead of optimizing the unnecessary. Did you explore the option to catch someway the server not running error from pg_ctl and take all this logic out?

I tried on my local machine by removing this process check and it is required because same gprecoverseg is used in rebalance operation also. In rebalance we have stopped the segments already and then initiate the gprecoverseg command again to switch the roles. During this step as we have already stopped the segments, again we trying to stop the process but missing postmaster.pid file makes sure we aren to stopping them. As while we are ignoring the errors, but we try to stop the segments again is resulting in few of the behave tests failure where it is expected that we are stopping segments only once. So removing this check of process before stopping is not a good idea at the moment. But in future when we refactor the rebalance recovery operation, we can take care of this.

That finding reinforces that current separate logic to check for postmaster.pid file is not required. pg_ctl itself is able to convey that information (via error message) if the file is not present instead of checking for it. So, if error is received checking the error message should inform the failure is due to file not existing, instead of separately checking for it. Hence, that makes it more strong argument to not do separate check for the file.

Yes, that logic to check f process exists is handled by pg_ctl already. Question here is, if we remove that part, during rebalance operation we try to stop non existent process and print following message:

INFO]:-
      COMMAND RESULTS
      STATUS--DIR:/Users/pchandwadkar/workspace/gpdb/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0--STOPPED:True--REASON:Forceful termination success: rc: 1 stdout:  stderr: pg_ctl: PID file "/Users/pchandwadkar/workspace/gpdb/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/postmaster.pid" does not exist
      Is server running?

If we are fine with this, update the unit tests accordingly. @kaknikhil : what you suggest in this case?

piyushc01 avatar Sep 05 '22 11:09 piyushc01

Question here is, if we remove that part, during rebalance operation we try to stop non existent process and print following message:

Where is this information printed ? To the console or the log file ? I agree with Ashwin that pg_ctl should be enough by itself. Does the pg_ctl exception tell us the exact reason of the failure ? If yes, then we can make an informed decision based on that exception and print whatever we want to the console/log file. We don't need to print exactly what pg_ctl prints

kaknikhil avatar Sep 06 '22 21:09 kaknikhil

Question here is, if we remove that part, during rebalance operation we try to stop non existent process and print following message:

Where is this information printed ? To the console or the log file ? I agree with Ashwin that pg_ctl should be enough by itself. Does the pg_ctl exception tell us the exact reason of the failure ? If yes, then we can make an informed decision based on that exception and print whatever we want to the console/log file. We don't need to print exactly what pg_ctl prints

Thanks @kaknikhil. Currently we do not print this failed to stop messages to stdout but when in verbose mode gets printed to the console. Otherwise its part of the log. We are using gpsegstop.py script to stop the segments which in turn calls pg_ctl command and error message we get currently is as follows:

[INFO]:-
      COMMAND RESULTS
      STATUS--DIR:/Users/pchandwadkar/workspace/gpdb/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0--STOPPED:True--REASON:Forceful termination success: rc: 1 stdout:  stderr: pg_ctl: PID file "/Users/pchandwadkar/workspace/gpdb/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/postmaster.pid" does not exist
      Is server running?

piyushc01 avatar Sep 07 '22 11:09 piyushc01

Question here is, if we remove that part, during rebalance operation we try to stop non existent process and print following message:

Where is this information printed ? To the console or the log file ? I agree with Ashwin that pg_ctl should be enough by itself. Does the pg_ctl exception tell us the exact reason of the failure ? If yes, then we can make an informed decision based on that exception and print whatever we want to the console/log file. We don't need to print exactly what pg_ctl prints

Thanks @kaknikhil. Currently we do not print this failed to stop messages to stdout but when in verbose mode gets printed to the console. Otherwise its part of the log. We are using gpsegstop.py script to stop the segments which in turn calls pg_ctl command and error message we get currently is as follows:

[INFO]:-
      COMMAND RESULTS
      STATUS--DIR:/Users/pchandwadkar/workspace/gpdb/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0--STOPPED:True--REASON:Forceful termination success: rc: 1 stdout:  stderr: pg_ctl: PID file "/Users/pchandwadkar/workspace/gpdb/gpAux/gpdemo/datadirs/dbfast_mirror1/demoDataDir0/postmaster.pid" does not exist
      Is server running?

So, apart from success error code, if we get the error message "postmaster.pid does not exist" should be treated as success as well, as gpstop worked since server is not running is the end goal of the command.

ashwinstar avatar Sep 08 '22 16:09 ashwinstar

Closing this PR for now. Will create a new PR with changes to remove the checks for process validation before segment recovery.

piyushc01 avatar Nov 04 '22 10:11 piyushc01