cadence
cadence copied to clipboard
Query workflow high latency after a long inactive time
There is a design issue in Cadence that potentially cause queryWorkflow high latency. If query workflow is the first action after a long time period of inactivity, the query request could take more than 5 seconds.
When worker hosts restarted, the sticky tasklist may not be able to reset, and there is no mechanism to tell Cadence server to ensure resetting them today.Then later on when dispatching a query task, it still prioritized to send to the sticky tasklist, which will eventually timeout and then reset tasklist and then resend to normal taklist. As a result, the latency becomes much higher than usual.
3+ years ago, as a solution, we introduced stickyTTL in https://github.com/uber/cadence/issues/2261 is to invalidate the sticky tasklist when it expires the stickyTTL. This has proved to mitigate the prod issues in Uber. However, due to the potential perf penalty, we didn't change the default value.
Another idea is to implement https://github.com/uber/cadence/issues/2369 but this requires lots of work, and we never prioritize it.
Another approach is to automatically invalidate sticky tasklist when processing query task and there is no active poller for some time like 1 minutes. This is much safer than stickyTTL approach for perf penalty.
This is fixed in Temporal: https://github.com/temporalio/temporal/issues/2363