infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

staging.docs.ci.ocaml.org is unreachable

Open mtelvers opened this issue 2 years ago • 10 comments

staging.docs.ci.ocaml.org is unreachable over SSH and HTTPS. Please can it be rebooted?

mtelvers avatar Aug 24 '23 08:08 mtelvers

Rebooted; nothing on the console, but I suspect OOM killer. We do actually need to save a non-ssh-key login to these machines to access the dmesg (or have a log helper that shuttles the logs out regularly, but this doesn't help debug OOM-killer related failures)

avsm avatar Aug 29 '23 10:08 avsm

Thanks @avsm I am working on restoring staging.docs.ci.ocaml.org.

tmcgilchrist avatar Aug 30 '23 22:08 tmcgilchrist

@avsm Can you check on staging.docs.ci.ocaml.org again? I set it up running ocaml-docs-ci from a clean slate but now it's again unreachable over SSH and HTTPS.

tmcgilchrist avatar Aug 31 '23 05:08 tmcgilchrist

Rebooted; nothing on the console, but I suspect OOM killer.

Last I saw on my console ocaml-docs-ci was using approx 10Gb RAM and was stable on that amount. Nothing else was using large amounts of RAM and only the local solver instances would be using significant CPU while it was resolving.

tmcgilchrist avatar Aug 31 '23 05:08 tmcgilchrist

Seems to be back this morning @avsm

tmcgilchrist avatar Aug 31 '23 21:08 tmcgilchrist

And now it's unreachable again @avsm curious to see if there are network errors or unexpected shutdowns on that machine.

tmcgilchrist avatar Sep 01 '23 06:09 tmcgilchrist

Rebooted. No indications of anything untoward on the console...

avsm avatar Sep 07 '23 18:09 avsm

The missing dashboards on Grafana have been fixed but for some reason, the only data we have for staging is a blip in July. This needs further investigation but, unsurprisingly the server is once more unreachable. While waiting for the restart of this staging server, we'll see if the issue can be reproduced on a VM.

rikusilvola avatar Sep 15 '23 11:09 rikusilvola

@avsm This machine is un-available again, really confused what is happening with it.

@mtelvers has setup an alternative instance on https://staging.docs.ci.ocamllabs.io that we have switched over to using. Additionally we have a working docker-compose setup for the entire ocaml-docs-ci pipeline. So I think we can remove this machine for now.

tmcgilchrist avatar Sep 21 '23 02:09 tmcgilchrist

Ping @avsm

tmcgilchrist avatar Nov 06 '23 06:11 tmcgilchrist