infrastructure
infrastructure copied to clipboard
staging.docs.ci.ocaml.org is unreachable
staging.docs.ci.ocaml.org is unreachable over SSH and HTTPS. Please can it be rebooted?
Rebooted; nothing on the console, but I suspect OOM killer. We do actually need to save a non-ssh-key login to these machines to access the dmesg (or have a log helper that shuttles the logs out regularly, but this doesn't help debug OOM-killer related failures)
Thanks @avsm I am working on restoring staging.docs.ci.ocaml.org.
@avsm Can you check on staging.docs.ci.ocaml.org again? I set it up running ocaml-docs-ci from a clean slate but now it's again unreachable over SSH and HTTPS.
Rebooted; nothing on the console, but I suspect OOM killer.
Last I saw on my console ocaml-docs-ci was using approx 10Gb RAM and was stable on that amount. Nothing else was using large amounts of RAM and only the local solver instances would be using significant CPU while it was resolving.
Seems to be back this morning @avsm
And now it's unreachable again @avsm curious to see if there are network errors or unexpected shutdowns on that machine.
Rebooted. No indications of anything untoward on the console...
The missing dashboards on Grafana have been fixed but for some reason, the only data we have for staging is a blip in July. This needs further investigation but, unsurprisingly the server is once more unreachable. While waiting for the restart of this staging server, we'll see if the issue can be reproduced on a VM.
@avsm This machine is un-available again, really confused what is happening with it.
@mtelvers has setup an alternative instance on https://staging.docs.ci.ocamllabs.io that we have switched over to using. Additionally we have a working docker-compose setup for the entire ocaml-docs-ci pipeline. So I think we can remove this machine for now.
Ping @avsm