infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

Add status.ocaml.org for monitoring

Open tmcgilchrist opened this issue 2 years ago • 3 comments

Migrating issue from the wiki to allow discussion.

What should be on a status.ocaml.org page?

At a minimum we should have operational status of:

  • OCaml website @ ocaml.org
  • watch.ocaml.org
  • Documentation CI pipeline https://docs.ci.ocaml.org
  • Deployer https://deploy.ci.ocaml.org

What are the options for hosting? Independent of the current infrastructure.

tmcgilchrist avatar Mar 09 '23 23:03 tmcgilchrist

This is a good list to trawl through: https://github.com/ivbeg/awesome-status-pages. We could host it separately of the Scaleway and Cambridge Computer Lab infrastructure on Mythic Beasts, if not using one of the hosted options.

avsm avatar Mar 11 '23 16:03 avsm

I'm keen the style of something like https://status.gitlab.com that has space for the various public facing pieces plus the sub-systems that make everything work.

We are starting with a bottom up approach of building monitoring pages for each of:

  • Documentation CI pipeline https://docs.ci.ocaml.org/
  • Deployer https://deploy.ci.ocaml.org/
  • Network status like reachability and ssl/tls certs on http://observer.ocamllabs.io
  • OCluster instance health

Then we can choose something independently hosted to feed those checks into. This is just an update to say we are working towards this, with work still to do. :-)

tmcgilchrist avatar May 29 '23 06:05 tmcgilchrist

This all sounds good. Might you please coordinate with @mtelvers on his observer.ocamllabs.io prototype mentioned in https://github.com/ocaml/infrastructure/issues/42#issuecomment-1554623791? That looks like a good start, but I suspect its database will grow quite quickly as it's storing the results of ping rebuilds in each ocurrent node.

Also as @hannesm mentions in #48, we need a check for the freshness of opam.ocaml.org. I suspect that would be better done as a email/Matrix message from a build failure in the deployer pipeline rather than a healthcheck though, since otherwise it'll be difficult to distinguish between "no pushes to opam-repo recently" and "not a fresh archive on opam.ocaml.org".

avsm avatar Jun 06 '23 17:06 avsm