sdk-python icon indicating copy to clipboard operation
sdk-python copied to clipboard

[Bug] Sticky execution after Worker shutdown causes "Workflow Task Timed Out"

Open gonced8 opened this issue 10 months ago • 2 comments

What are you really trying to do?

Restarting workers, even with graceful shutdown, causes Workflow Tasks to timeout due to sticky execution.

Describe the bug

When a Workflow is executing and the Workflow Worker restarts, the next Workflow Task still gets scheduled to that worker, due to execution stickiness, and causes a "Workflow Task Timed Out". It would be expected for the Worker to inform Temporal server of its shutdown and disable the stickiness, scheduling the next Workflow Task to the original task queue.

Moreover, as for informing the server, the Workflow Worker should inform as soon as the shutdown process starts, so that no new Workflow Tasks are assigned to this worker during its graceful shutdown, because the worker wouldn't be polling anymore and those Workflow Tasks would timeout because of that.

Below I attach a minimal example to reproduce this bug. The Workflow has a 10s sleep and then returns. The Workflow Worker starts, stays running for 5 seconds and shutdowns. The Workflow Task after the sleep should be assigned to the original task queue, since the Workflow Worker is no longer running.

Here you can see a screenshot of the current behavior. There is a "Workflow Task Timed Out" because of sticky execution. Image

Disabling the sticky execution (max_cached_workflows=0), the Workflow Task doesn't timeout, as expected. Image

Minimal Reproduction

import asyncio
import multiprocessing
from datetime import timedelta

from temporalio import workflow
from temporalio.client import Client
from temporalio.worker import Worker


@workflow.defn
class SimpleWorkflow:
    @workflow.run
    async def run(self) -> None:
        workflow.logger.info("Running SimpleWorkflow")
        workflow.logger.info("Sleeping for 10 seconds")
        await workflow.sleep(timedelta(seconds=10))
        workflow.logger.info("Done sleeping")
        return


async def worker():
    # Create client connected to server at the given address
    client = await Client.connect("localhost:7233")

    # Run the worker
    async with Worker(
        client,
        task_queue="my-task-queue",
        workflows=[SimpleWorkflow],
        graceful_shutdown_timeout=timedelta(seconds=3),
        # max_cached_workflows=0,
    ) as worker:
        print("Starting worker")
        # Run for 5 seconds
        await asyncio.sleep(5)
        print("Stopping worker")

    print("Stopping Temporal client")


async def client():
    # Create client connected to server at the given address
    client = await Client.connect("localhost:7233")

    # Execute a workflow
    print("Executing SimpleWorkflow")
    await client.execute_workflow(
        SimpleWorkflow.run,
        id="test-id",
        task_queue="my-task-queue",
        run_timeout=timedelta(seconds=25),
    )
    print(f"Finished SimpleWorkflow")


def start_client():
    asyncio.run(client())


if __name__ == "__main__":
    # Start client in different process
    multiprocessing.Process(target=start_client).start()

    # Start worker
    asyncio.run(worker())
    print("Exited worker")

Environment/Versions

  • OS and processor: M2 Mac, but same error happening with AMD based image
  • Temporal Version: CLI 1.3.0 (Server 1.27.1, UI 2.36.0) and sdk-python 1.10.0
  • Are you using Docker or Kubernetes or building Temporal from source? Tried with Temporal CLI and also with Docker and Kubernetes.

Additional context

Discussed this issue with @cretz during Replay conference (thank you!).

gonced8 avatar Mar 05 '25 15:03 gonced8

Thanks for opening this! We will investigate. This penalty always existed in the past but we have a new-ish ShutdownWorker call and something may be amiss here.

cretz avatar Mar 07 '25 11:03 cretz

Just a heads up that we are looking into this. One issue we have found is that ShutdownWorker may not be working properly on cloud and is in the process of being fixed. However, this issue seems to state that you can replicate with a local dev server. We have replicated this and believe it may be a server-side bug. We will update when we have more details.

cretz avatar Mar 26 '25 13:03 cretz

Ok, we have done investigation on this and this is expected behavior, but the cosmetics of the task timeout could be improved and we have opened an issue on that (https://github.com/temporalio/temporal/issues/7566).

So what's happening in your code snippet is you are shutting down the worker but not start a new one. If you had a new worker you would see no penalty or task timeout. The task is showing as scheduled on the sticky queue but it's actually not on that queue necessarily since it has been shutdown, and if a worker were available listening on the normal queue, that would be the one that picked it up. The task timeout is only to re-schedule the task back on the normal queue explicitly instead of implicitly, and that only happens because there is no worker around.

So to recap, there is no runtime sticky penalty and the task scheduled on the sticky queue is actually effectively scheduled on both, and the only reason you are getting a task timeout is because there is no worker running for the on-both-queues task so it reschedules on. But there is no runtime behavior difference here.

Does this make sense? Can you confirm that this issue is only cosmetic in that you see a task timeout in history and does not have any bearing on the runtime behavior? Can you also confirm if you have another worker running instead of no worker running even the task timeout in history doesn't occur?

cretz avatar Apr 03 '25 18:04 cretz

Yes, it makes sense and thank you for the explanation! I tested locally by adding a second worker and it worked exactly as expected. Thank you for your time!

gonced8 avatar Apr 04 '25 14:04 gonced8