stroom icon indicating copy to clipboard operation
stroom copied to clipboard

Stroom Proxy - Ability to drain

Open gcdev373 opened this issue 1 year ago • 1 comments

When shutting down a Stroom proxy instance, it is often necessary to drain its data to downstream Stroom endpoints.

This is most important when the system is going down. And it is vital before shutting down a Stroom Proxy instance permanently.

Stroom Proxy should have an option that it automatically attempts to drain on shutdown. And ideally be distributed with a separate script that drains proxy and returns a zero exit code only when complete, to assist scripting.

gcdev373 avatar Jun 07 '24 12:06 gcdev373

There might be a bit of a hole when a feed status changes to a REJECT or DROP status from RECEIVE. Ideally, any events already received at Proxy could still be forwarded to Stroom. Might be worth considering at same time as this issue.

gcdev373 avatar Jun 17 '24 09:06 gcdev373

This would integrate nicely with K8s, where pods may be terminated and rescheduled as part of day-to-day operations (such as node maintenance).

The Stroom K8s Operator takes the approach of disabling tasks and waiting for them to complete using the K8s pre-stop hook.

A similar approach could be taken with stroom-proxy, whereby an API endpoint is called, causing the application to forward any streams currently held in its repo to the destination(s), prior to termination.

The sequence of events would be:

  1. Pod receives signal to terminate
  2. Pod enters Terminating state. K8s at this point, removes it from the Service, effectively preventing it from receiving further traffic.
  3. Pre-stop hook fires, which executes a bash script that calls an API and waits for remaining files in the repo to forward. Once this completes (or the terminationGracePeriodSeconds interval expires), the script completes and the pod is allowed to terminate.
  4. A termination signal is sent to the stroom-proxy process, which causes the pod to be terminated.

p-kimberley avatar Nov 21 '24 09:11 p-kimberley

On proxy this could be done with a Dropwizard Task (https://www.dropwizard.io/en/release-4.0.x/manual/core.html#tasks) on the admin port to trigger the drain down, possibly with another task to poll for the drain status. This would keep the action away from the main api as it is on a different port.

Ideally, stroom & proxy should know themselves what things need to be done for an orderly shutdown, so the admin or k8s operator either needs to fire one api call to initiate it or it just happens as part of the normal shutdown process.

at055612 avatar Nov 21 '24 10:11 at055612

Will also need to consider the retry queues. There is probably no point in waiting for items on the retry queue to send before shutting down as they may be liable to fail again.

In a scale down scenario where a proxy node is being taken out of the cluster, some thought is needed on what to do with anything left in the retry queues or error destinations.

at055612 avatar Mar 11 '25 14:03 at055612