pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

Infrastructure to handle dynamic services

Open gkokolatos opened this issue 4 years ago • 2 comments

This PR is a result of the discussion found in #539 and should be considered as part one in a series or future PRs.

  • What is a dynamic service?

Until now, any failure in a service running under supervision can potentially lead to a full node failure. That behaviour was solely controlled by the restart policy of the service. Additionally services were statically defined or configured and initialized during startup. Once a service was started, was not possible to stop it individually.

A dynamic service is one that can be started or stopped at will, at any point during the lifetime of the process group. It is run under supervision as any other service, yet any failures will not affect any other services of the node, regardless of its restart policy.

  • Why is it useful?

It is useful if non-core services are required. Such a service could be a connection pooler. Connection poolers can be added or removed at any node and at any point in the lifetime of the node.

  • Implementation details.

The careful reader will notice that the distinguishable characteristic of a dynamic service does not lay with the service itself but with the supervisor of the node. Meaning that it is the supervisor's responsibility to start, stop and restart a service. As such, the current implementation has left the service struct intact and has expanded the supervisor accordingly.

The supervisor interface is expanded with three new functions, supervisor_add_dynamic() and supervisor_remove_dynamic() in order to add/start and to remove/stop a service from supervision respectively, and supervisor_service_exists() in order to find a service run under supervision. All services are uniquely identifiable by name.

The caller of the supervisor can decide during startup, if it will require the handling of any dynamic services. If it does, then it will have to provide a dynamicHandler() function during startup. The supervisor will call it, when signaled to do so. To accommodate for the above, the function supervisor_start() has been expanded accordingly.

The structure defining the supervisor has been expanded with two anonymous structs, dynamicServices and dynamicRecentlyRemoved. These structs are not meant to be accessed directly. What is a bit more interesting is why there were two needed. It is due to the supervisor_loop design. There exists a disconnection between a service stopping and an action taken for it. During this, it is required to remember which service was stopped and its previously running pid, for the supervisor to take the appropriate action. In short, if an exited service is found in the dynamicRecentlyRemoved array, then the supervisor can continue as normal.

gkokolatos avatar Jan 18 '21 15:01 gkokolatos

@DimCitus Thank you for your comments and review so far.

As you can see, there have been a few commits that should address some of the minor and easy to fix comments. If you do not mind, let us get those straight before moving on. If you do mind, please let me know and will do a second pass.

gkokolatos avatar Jan 19 '21 16:01 gkokolatos

@DimCitus thank you for your reviewing efforts.

Please find some updates based on comments and off list discussions. If there are any existing comments that can be resolved, I would much appreciate if they got marked as such, in order to focus my efforts on the remaining, or new, issues.

gkokolatos avatar Jan 28 '21 17:01 gkokolatos