prefect Heartbeat does not detect zombie processes when using Local Agent

Description

I have been testing different situations where a task may fail by external causes (i.e.: I used a kill --9 command to kill the task process). I discovered that using a Local Agent lead to never detect a Zombie Process, neither using Prefect Cloud or locally on my Laptop. However, if I stop the Local Agent and restart it, then it detects the zombie process and works correctly, even rescheduled if using Prefect Cloud thanks to the Lazarus process.

To give more information, using the Docker Agent and kill the flow running docker kill <contained_id> it works correctly (after a few minutes it retries the flow again) and there is no need to restart the agent.

Expected Behavior

I expect that all the stuff done when restarting the Local Agent works correctly without that need.

Reproduction

Here I give you the flow definition that I used to test this:

import datetime
import time
import os
import prefect
from prefect import task, Flow


def append_result(result):
    with open("/tmp/file.txt", "a") as f:
        f.write(result)
        f.write("\n")

@task
def delete_file():
    try:
        os.remove('/tmp/file.txt')
    except:
        pass

@task(max_retries=5, retry_delay=datetime.timedelta(seconds=2), timeout=60)
def generate_file_simple():
    for i in range(10):
        time.sleep(1)
        append_result(f"{datetime.datetime.now()}: {i}. I am PID: {os.getpid()}")
        
        
with Flow("be-killed") as f:
    t1 = delete_file()
    t2 = generate_file_simple()

    # set dependency
    t2.set_upstream(t1)

# register flow in prefect cloud
with open('../prefect-cloud-user-token') as f:
    user_api_token = f.read().strip()

client = prefect.Client(api_token=user_api_token)
client.login_to_tenant(tenant_slug='XXXXX')
flow_run_id = client.create_flow_run(flow_id=flow_id)

When I see that the agent is running the task, and I verify that the file is being written, I kill the process by running, where the PID is being written in each line in the file being written:

import os
os.kill(XXXX,  9)

Environment

{
  "config_overrides": {},
  "env_vars": [],
  "system_information": {
    "platform": "Darwin-19.4.0-x86_64-i386-64bit",
    "prefect_version": "0.11.2",
    "python_version": "3.7.7"
  }
}

Jun 20 '20 11:06 jcozar87

Hi @jcozar87 - very nice find; I can explain why this is happening for reference, but I agree this is not desirable for a number of reasons.

Heartbeats are spawned by each task run as a subprocess (here). This subprocess polls the Cloud API as long as it stays alive. It is only cleaned up once the task runner exits (whether successfully or unsuccessfully).

The local agent is unique in that it submits flow runs to run in a subprocess. Whenever you kill the local agent and / or the subprocess running the flow run, the heartbeat process is not correctly cleaned up. (This is in contrast to all other agents that run Flows within docker containers, so killing the docker container kills the corresponding processes).

We should definitely look to fix this so that heartbeats both behave as expected and to ensure Prefect correctly cleans up the resources that it creates.

Jun 21 '20 21:06 cicdw

Oh I see! Thank you very much! I guess that as local agent submits flow runs to run in a subprocess, the heatbeat should check the subprocess status as well.

However I find the Docker (and Docker Agent) more reliable in production. Therefore, and being the issue totally explainable, I will use Prefect very confident :-)

Jun 22 '20 06:06 jcozar87

I ran into this while testing prefect's versioning feature. In detail, I ran the local agent, deployed and started a flow, sent sig TERM to the agent mid-flow, deployed and started a new version of the flow and restarted the agent. To my surprise the first run just stayed around in running/pending state.

If I understand the description of the issue correctly, all the agents are vulnerable to this failure mode in the sense that there is a process running on "our" infrastructure which causes flows to silently hang if it is killed along with the agent (even if killed gracefully). Is that correctly understood?

Aug 10 '22 17:08 dahlbaek

I still see this behavior on Prefect 2.6.4 running open source. We run the agent in Docker, and upon stopping that container:

Flow Runs are stuck in 'running' state, with no more updates. (it has stopped running, and no future agent can pick it up)

While the local agent runs, we notice completed flow runs result in lots of [python3] <defunct> processes that can be found with ps -ef .. which isn't a huge problem (I hope). But it suggests the same behavior @cicdw pointed out above, is still around in Prefect 2.

The local agent is unique in that it submits flow runs to run in a subprocess. Whenever you kill the local agent and / or the subprocess running the flow run, the heartbeat process is not correctly cleaned up. (This is in contrast to all other agents that run Flows within docker containers, so killing the docker container kills the corresponding processes).

We should definitely look to fix this so that heartbeats both behave as expected and to ensure Prefect correctly cleans up the resources that it creates.

I guess they always point us towards using the other Infra blocks, rather than local agents.

Oct 25 '22 14:10 kevin868

@kevin868 There is a different issue #7239 for v2. There are no heartbeats in v2 at this time, but we would like to figure out a way to get this working.

Oct 25 '22 14:10 zanieb

I am also having this issue but with DockerAgent: Often, when the docker container that is running the flow dies for whatever reason, the flow status stays "Running" forever. Is this a known issue? (in V1)

Nov 18 '22 17:11 jkoeller

I am also facing the issue that the flow is running forever but there are no logs (my Prefect agent version is 0.14.22). My current project runs more than 50 flows each time with an average of about 5 minutes per flow to complete. But there would sometimes be 1-2 flows that keep running forever without logging any logs. They just stopped at some random steps (requesting URL step, or even when all the tasks are finished). I have been researching to fix the issue for several days now, and I have received suggestions that this issue is infrastructure-based, maybe memory or something, which I find somewhat makes sense. Is there any known issue related to what I am facing? And apart from the memory, where can the issue probably come from?

Lately, I did apply the timeout, and if the flow throws a Timedout state, I will make that flow run again. But this seems not sufficient. So I am seeking a different approach. Am I right that assume this is an infrastructure-based problem and a memory-based problem? If it's memory, how do I properly manage the memory usage of flows? Thank you in advanced.

Nov 25 '22 09:11 vuanninh2022

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

Dec 25 '22 10:12 github-actions[bot]

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

Jan 08 '23 10:01 github-actions[bot]

prefect prefect copied to clipboard

Heartbeat does not detect zombie processes when using Local Agent

Description

Expected Behavior

Reproduction

Environment

prefect
prefect copied to clipboard