zenml Kubernetes Orchestrator fails to update the run status if the server goes down during the execution.

Contact Details [Optional]

No response

System Information

ZenML Kubernetes stack with the stress test pipeline 1000 steps/100 parallel

What happened?

I ran the stress test pipeline. Due to the heavy load and the memory limitations my server was configured to, the server restarted. During this process, some steps failed to start. At the end the orchestrator logs were as follows:

The orchestrator endpoint code that logs this looks as follows:

if failed_step_names:
    logger.error(
        "The following steps failed: %s",
        ", ".join(failed_step_names),
    )
if skipped_step_names:
    logger.error(
        "The following steps were skipped because some of their "
        "upstream steps failed: %s",
        ", ".join(skipped_step_names),
    )

step_runs = fetch_step_runs_by_names(
    step_run_names=failed_step_names, pipeline_run=pipeline_run
)

for step_name, node_state in nodes_statuses.items():
    if node_state != NodeStatus.FAILED:
        continue

    pipeline_failed = True

    if step_run := step_runs.get(step_name, None):
        # Try to update the step run status, if it exists and is in
        # a transient state.
        if step_run and step_run.status in {
            ExecutionStatus.INITIALIZING,
            ExecutionStatus.RUNNING,
        }:
            publish_utils.publish_failed_step_run(step_run.id)

# If any steps failed and the pipeline run is still in a transient
# state, we need to mark it as failed.
if pipeline_failed and pipeline_run.status in {
    ExecutionStatus.INITIALIZING,
    ExecutionStatus.RUNNING,
}:
    publish_utils.publish_failed_pipeline_run(pipeline_run.id)

This should have published the failed status for the pipeline run, but somehow the pipeline run got stuck in the RUNNING state.

Reproduction steps

Deploy a workspace with the high availability tier, run the stress test with 1000 steps with 100 parallel.

Relevant log output

Code of Conduct

[x] I agree to follow this project's Code of Conduct

Sep 11 '25 12:09 bcdurak

Hi @bcdurak, I’d like to work on this issue. From my initial understanding, the solution would involve adding an async function to the current code that does health checks to the server to see if its up and if not, terminate the execution and update the run status to failed. Please let me know if that direction aligns with what you have in mind, and if so, it would be awesome if you assign me :)

Sep 16 '25 16:09 adam6878

Hey @adam6878, we are working on adding a health-check/heartbeat to our steps, so that part will soon be covered.

This issue is more about figuring out what kind of bug that led to this state. If you take a look at the code block and the logs I've shared above, you can see:

There are 4 failed steps. These steps do not exist in the DB as the server was down during this process.
The server was successfully rebooted and the remaining steps were successfully executed.
However, the status of the run got stuck in a Running state even though the run finished with failed steps.
The failed update of the run status through the publish_failed_pipeline_run function did not work as intended.

I hope this clears it up a bit and feel free to take a look :)

Sep 17 '25 08:09 bcdurak