Losing network for a while can endup with the runner running forever (GH at least)
Despite that we have added the job check on idleTimeout the repo can stuck with the job running and runner status busy until job timeout at least forever in the worst case scenario (confirming...)
The job was mean to be sleep 120
info: Launching github runner
info: runner status {"date":"2022-05-24T09:25:02.621Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status √ Connected to GitHub {"date":"2022-05-24T09:25:02.623Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
warn: SpotNotifier can not be started.
info: runner status Current runner version: '2.292.0' {"date":"2022-05-24T09:25:03.503Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Listening for Jobs {"date":"2022-05-24T09:25:03.504Z","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"ready"}
info: runner status Running job: train {"date":"2022-05-24T09:25:33.331Z","job":"gh","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"job_started"}
info: runner status Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected. {"date":"2022-05-24T09:29:05.389Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Runner reconnected. {"date":"2022-05-24T09:30:22.225Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
From other experiences with self-hosted runners, this is not really a cml runner issue.
the only change I would recommend would be to start an idle-timeout check after detecting the "reconnected" event.
the crux is that we can't really trust the GH API's job status after something like this.
Indeed right now the pipeline failed but GH says that the job is still going on (I also checked via api)
So our solution to make runner more stable is not worth or very unreliable since we depend on them
In this particular case the runner is idle, so it would me more effective than looking a the job that stays running
IDK, this is not an easy case to really handle in the context of cml/what we have right now. If there are network problems then we will likely have problems making the API calls to delete the instance unless the connectivity issue is directly with GitHub, in which case I think the instance should just terminate since it can't get jobs or finish any that it has (in a meaningful way)...
think the instance should just terminate since it can't get jobs or finish any that it has
I agree. If the conn is lost the runner must shutdown to avoid having a cloud runner running forever. Having a flag to be able to force to continue
instance termination on error seems sensible (to avoid spiralling costs).
Not sure about debugging though (some users may prefer to not have runners shut down? maybe add a flag to prevent auto-termination?)
Just for curiosity. My pipeline still displays the job as running
Just for curiosity. My pipeline still displays the job as running
I think it will for awhile 🙈 😆
While reviewing the 72h -> 35d GHA self-hosted runner timeout change (#1064), stumbled on:
A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days.^1
So even GH eventually shuns unreachable runners. We should certainly shut them down.
I don't think we have anything actionable here? When the connection is lost, and the agent can't resume the connection the process should exit and the runner terminate. If the network is completely disconnected, I think we can safely say that is beyond the scope of what can be handled.
safely say that is beyond the scope
Disagree; network disconnects are clearly sometimes possible, and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate. Related: https://github.com/iterative/terraform-provider-iterative/issues/289.
But I have a feeling I haven't understood correctly.
network disconnects are clearly sometimes possible
correct and that shouldn't break it. I have observed github actions handle minor network interruptions just fine.
and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate.
hiccups are not the issue, the only time I've seen the described behavior is when something went very wrong with the instance. So my point is that seg faulting, OOM (to a lesser extent), and pulling the network plug are not things we can recover from. If no network connection exists we can't make the API calls to delete the machine.
I suspect we're having multiple different conversations here xD
I think from our internal discussion we can close this as not planned, If someone disagrees go ahead and re-open (preferably with clear case or something reproducible :wink: )