kubo icon indicating copy to clipboard operation
kubo copied to clipboard

fix(ci): cleanup zombie daemons on panics

Open lidel opened this issue 4 months ago • 1 comments

[!WARNING] Do not merge, using this PR as a sandbox for now, to see what fails on CI but rarely locally.

This PR is exploration into making test/cli/harness more robust when things go bad with spawned processes and we run on self-hosted runners.

  • Process tracking: new system tracks all spawned processes and force-kills them on test end/panic
  • Connection fixes: validate peers array before access, use serial connections to avoid TLS issues
  • Daemon reliability: add retry logic for transient startup failures, verify daemon readiness
    • :point_right: unsure if this is the right thing to do -- i'm ok with removing this from this PR (important part is tracking dameons and ensuring we kill them).
  • Auto-cleanup: background processes started via harness are automatically registered for cleanup

Impact:

  • No more zombie ipfs daemon processes left after test panics
  • More reliable test execution in CI
  • Cleaner test harness code with better error handling

lidel avatar Aug 23 '25 23:08 lidel

add retry logic for transient startup failures

I want to be more careful about not masking any startup failures, even if transient. When we identify a specific failure condition, like a TCP port is still in use but will not be at the next retry, then we can retry on detecting that specific case. We should not retry on any unknown failure.

gammazero avatar Aug 25 '25 17:08 gammazero