cml Losing network for a while can endup with the runner running forever (GH at least)

Despite that we have added the job check on idleTimeout the repo can stuck with the job running and runner status busy until job timeout at least forever in the worst case scenario (confirming...)

May 24 '22 09:05 DavidGOrtega

The job was mean to be sleep 120

May 24 '22 09:05 DavidGOrtega

info: Launching github runner
info: runner status {"date":"2022-05-24T09:25:02.621Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status √ Connected to GitHub {"date":"2022-05-24T09:25:02.623Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
warn: SpotNotifier can not be started.
info: runner status Current runner version: '2.292.0' {"date":"2022-05-24T09:25:03.503Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Listening for Jobs {"date":"2022-05-24T09:25:03.504Z","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"ready"}
info: runner status Running job: train {"date":"2022-05-24T09:25:33.331Z","job":"gh","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"job_started"}
info: runner status Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected. {"date":"2022-05-24T09:29:05.389Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Runner reconnected. {"date":"2022-05-24T09:30:22.225Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}

May 24 '22 09:05 DavidGOrtega

From other experiences with self-hosted runners, this is not really a cml runner issue.

the only change I would recommend would be to start an idle-timeout check after detecting the "reconnected" event.

May 24 '22 14:05 dacbd

the crux is that we can't really trust the GH API's job status after something like this.

May 24 '22 14:05 dacbd

Indeed right now the pipeline failed but GH says that the job is still going on (I also checked via api)

So our solution to make runner more stable is not worth or very unreliable since we depend on them

May 24 '22 15:05 DavidGOrtega

In this particular case the runner is idle, so it would me more effective than looking a the job that stays running

May 24 '22 15:05 DavidGOrtega

IDK, this is not an easy case to really handle in the context of cml/what we have right now. If there are network problems then we will likely have problems making the API calls to delete the instance unless the connectivity issue is directly with GitHub, in which case I think the instance should just terminate since it can't get jobs or finish any that it has (in a meaningful way)...

May 24 '22 15:05 dacbd

think the instance should just terminate since it can't get jobs or finish any that it has

I agree. If the conn is lost the runner must shutdown to avoid having a cloud runner running forever. Having a flag to be able to force to continue

May 24 '22 16:05 DavidGOrtega

instance termination on error seems sensible (to avoid spiralling costs).

Not sure about debugging though (some users may prefer to not have runners shut down? maybe add a flag to prevent auto-termination?)

May 24 '22 17:05 casperdcl

Just for curiosity. My pipeline still displays the job as running

May 25 '22 15:05 DavidGOrtega

Just for curiosity. My pipeline still displays the job as running

I think it will for awhile 🙈 😆

Jun 02 '22 00:06 dacbd

While reviewing the 72h -> 35d GHA self-hosted runner timeout change (#1064), stumbled on:

A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days.^1

So even GH eventually shuns unreachable runners. We should certainly shut them down.

Jun 20 '22 08:06 casperdcl

I don't think we have anything actionable here? When the connection is lost, and the agent can't resume the connection the process should exit and the runner terminate. If the network is completely disconnected, I think we can safely say that is beyond the scope of what can be handled.

Oct 24 '22 17:10 dacbd

safely say that is beyond the scope

Disagree; network disconnects are clearly sometimes possible, and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate. Related: https://github.com/iterative/terraform-provider-iterative/issues/289.

But I have a feeling I haven't understood correctly.

Oct 28 '22 22:10 casperdcl

network disconnects are clearly sometimes possible

correct and that shouldn't break it. I have observed github actions handle minor network interruptions just fine.

and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate.

hiccups are not the issue, the only time I've seen the described behavior is when something went very wrong with the instance. So my point is that seg faulting, OOM (to a lesser extent), and pulling the network plug are not things we can recover from. If no network connection exists we can't make the API calls to delete the machine.

Oct 31 '22 14:10 dacbd

I suspect we're having multiple different conversations here xD

Nov 04 '22 22:11 casperdcl

I think from our internal discussion we can close this as not planned, If someone disagrees go ahead and re-open (preferably with clear case or something reproducible :wink: )

Nov 09 '22 16:11 dacbd