fix(ci): cleanup zombie daemons on panics

Open lidel opened this issue 4 months ago • 1 comments

[!WARNING] Do not merge, using this PR as a sandbox for now, to see what fails on CI but rarely locally.

This PR is exploration into making test/cli/harness more robust when things go bad with spawned processes and we run on self-hosted runners.

Process tracking: new system tracks all spawned processes and force-kills them on test end/panic
Connection fixes: validate peers array before access, use serial connections to avoid TLS issues
Daemon reliability: add retry logic for transient startup failures, verify daemon readiness
- :point_right: unsure if this is the right thing to do -- i'm ok with removing this from this PR (important part is tracking dameons and ensuring we kill them).
Auto-cleanup: background processes started via harness are automatically registered for cleanup

Impact:

No more zombie ipfs daemon processes left after test panics
More reliable test execution in CI
Cleaner test harness code with better error handling

Aug 23 '25 23:08 lidel

add retry logic for transient startup failures

I want to be more careful about not masking any startup failures, even if transient. When we identify a specific failure condition, like a TCP port is still in use but will not be at the next retry, then we can retry on detecting that specific case. We should not retry on any unknown failure.

Aug 25 '25 17:08 gammazero