[Bug] Sticky execution after Worker shutdown causes "Workflow Task Timed Out"
What are you really trying to do?
Restarting workers, even with graceful shutdown, causes Workflow Tasks to timeout due to sticky execution.
Describe the bug
When a Workflow is executing and the Workflow Worker restarts, the next Workflow Task still gets scheduled to that worker, due to execution stickiness, and causes a "Workflow Task Timed Out". It would be expected for the Worker to inform Temporal server of its shutdown and disable the stickiness, scheduling the next Workflow Task to the original task queue.
Moreover, as for informing the server, the Workflow Worker should inform as soon as the shutdown process starts, so that no new Workflow Tasks are assigned to this worker during its graceful shutdown, because the worker wouldn't be polling anymore and those Workflow Tasks would timeout because of that.
Below I attach a minimal example to reproduce this bug. The Workflow has a 10s sleep and then returns. The Workflow Worker starts, stays running for 5 seconds and shutdowns. The Workflow Task after the sleep should be assigned to the original task queue, since the Workflow Worker is no longer running.
Here you can see a screenshot of the current behavior. There is a "Workflow Task Timed Out" because of sticky execution.
Disabling the sticky execution (max_cached_workflows=0), the Workflow Task doesn't timeout, as expected.
Minimal Reproduction
import asyncio
import multiprocessing
from datetime import timedelta
from temporalio import workflow
from temporalio.client import Client
from temporalio.worker import Worker
@workflow.defn
class SimpleWorkflow:
@workflow.run
async def run(self) -> None:
workflow.logger.info("Running SimpleWorkflow")
workflow.logger.info("Sleeping for 10 seconds")
await workflow.sleep(timedelta(seconds=10))
workflow.logger.info("Done sleeping")
return
async def worker():
# Create client connected to server at the given address
client = await Client.connect("localhost:7233")
# Run the worker
async with Worker(
client,
task_queue="my-task-queue",
workflows=[SimpleWorkflow],
graceful_shutdown_timeout=timedelta(seconds=3),
# max_cached_workflows=0,
) as worker:
print("Starting worker")
# Run for 5 seconds
await asyncio.sleep(5)
print("Stopping worker")
print("Stopping Temporal client")
async def client():
# Create client connected to server at the given address
client = await Client.connect("localhost:7233")
# Execute a workflow
print("Executing SimpleWorkflow")
await client.execute_workflow(
SimpleWorkflow.run,
id="test-id",
task_queue="my-task-queue",
run_timeout=timedelta(seconds=25),
)
print(f"Finished SimpleWorkflow")
def start_client():
asyncio.run(client())
if __name__ == "__main__":
# Start client in different process
multiprocessing.Process(target=start_client).start()
# Start worker
asyncio.run(worker())
print("Exited worker")
Environment/Versions
- OS and processor: M2 Mac, but same error happening with AMD based image
- Temporal Version: CLI 1.3.0 (Server 1.27.1, UI 2.36.0) and sdk-python 1.10.0
- Are you using Docker or Kubernetes or building Temporal from source? Tried with Temporal CLI and also with Docker and Kubernetes.
Additional context
Discussed this issue with @cretz during Replay conference (thank you!).
Thanks for opening this! We will investigate. This penalty always existed in the past but we have a new-ish ShutdownWorker call and something may be amiss here.
Just a heads up that we are looking into this. One issue we have found is that ShutdownWorker may not be working properly on cloud and is in the process of being fixed. However, this issue seems to state that you can replicate with a local dev server. We have replicated this and believe it may be a server-side bug. We will update when we have more details.
Ok, we have done investigation on this and this is expected behavior, but the cosmetics of the task timeout could be improved and we have opened an issue on that (https://github.com/temporalio/temporal/issues/7566).
So what's happening in your code snippet is you are shutting down the worker but not start a new one. If you had a new worker you would see no penalty or task timeout. The task is showing as scheduled on the sticky queue but it's actually not on that queue necessarily since it has been shutdown, and if a worker were available listening on the normal queue, that would be the one that picked it up. The task timeout is only to re-schedule the task back on the normal queue explicitly instead of implicitly, and that only happens because there is no worker around.
So to recap, there is no runtime sticky penalty and the task scheduled on the sticky queue is actually effectively scheduled on both, and the only reason you are getting a task timeout is because there is no worker running for the on-both-queues task so it reschedules on. But there is no runtime behavior difference here.
Does this make sense? Can you confirm that this issue is only cosmetic in that you see a task timeout in history and does not have any bearing on the runtime behavior? Can you also confirm if you have another worker running instead of no worker running even the task timeout in history doesn't occur?
Yes, it makes sense and thank you for the explanation! I tested locally by adding a second worker and it worked exactly as expected. Thank you for your time!