sdk-core icon indicating copy to clipboard operation
sdk-core copied to clipboard

[Feature Request] Investigate ways to determine if worker polling is healthy

Open cretz opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe.

There is currently no easy way to know if a worker's poll calls are failing. Users want to make a call on the worker to know whether it's healthy or backing off due to server failure.

Describe the solution you'd like

TBD. Options:

  • Create metrics like workflow_task_queue_poll_failed and activity_task_queue_poll_succeed/activity_task_queue_poll_failure and encourage checking those metrics
    • A bit hacky to ask users to do manual subtraction and state management
    • These metrics have value anyways, we should probably add them. long_request_failure is not very detailed (but technically good enough if we exposed a way to create custom metric labels per client).
  • Populate some kind of internal std::sync::atomic::AtomicBool for whether the last poll calls are successful for a worker (or client) and expose some kind of getter to check them
  • Support for general gRPC interceptors from lang through Rust could help advanced uses like this and others
  • Some other on-poll-failed callback mechanism?
  • Customize retry logic for workers so users can opt-in to eagerly failing workers a bit more aggressively

cretz avatar Jan 03 '23 17:01 cretz