examples icon indicating copy to clipboard operation
examples copied to clipboard

Monitoring deployments

Open jbednar opened this issue 3 years ago • 0 comments

Deployments from this repository are running on an AE5 instance using Kubernetes and are monitored by a variety of means:

  1. Kubernetes monitors each deployment automatically, restarting it if it encounters an error such as running out of memory
  2. We have an AWS Lambda runnning hourly that in case of failure generates a "Deployment uptime warning" email message listing the deployments ("endpoints") that have been started along with a list of which ones have some sort of failure (e.g. Failed https://attractors.pyviz.demo.anaconda.com/ Unexpected error code for header (405 expected): 502)
  3. We have another AWS Lambda running weekly that generates similar messages plus "No failure accessing ..." for each deployment that is working, so that we know the systems are still running
  4. We have a separate Lumen-based monitor tool we can visit that gives an overview of restarts/cpu/memory/etc. per deployment, plus session info if available (number of visitors, session duration, etc.)

There's a lot that needs to be done to improve either the monitoring or to deal with current issues found by the monitoring:

  • [ ] Jean-Luc: Most deployments are missing session info and related stats; need to edit the .yml to pass --rest-session-info --session-history -1 so that we can monitor them.
  • [ ] Philipp: The kube-ctl queries appear to be unreliable, leading to speedometers showing "-" and related issues; need to add some retries.
  • [ ] Philipp: The Details page doesn't seem to have a "name" sort-by option; presumably it should.
  • [ ] Philipp: Sometimes the Tabulator column doesn't change any of the rows even after hitting it several times and the arrow going up and down. Perhaps it's for all-NaN columns? Even in that case one would want the order, whatever it is, to reverse as the arrow goes up or down, or else it seems broken.
  • [ ] Philipp: The Details page doesn't seem to have a "name" sort-by option; presumably it should.
  • [ ] Maxime?: Many of the examples appear to be leaking memory, which probably needs to be investigated one by one. Categorization, in order of how alarming they look:
    • Clear leak (memory usage going up a lot daily): iex_trading, glaciers, voila_gpx_viewer, clifford, penguin_crossfilter, Panel-Gallery, attractors-1 (probably attractors-panel?), Portfolio Optimizer, euler, gapminders
    • Possible leak (Memory usage going up, but only slowly and for little usage): particle-swarms
    • No leak visible (though perhaps simply low usage): opensky, particle-swarms, gull_tracking, uk_researchers, palmer_penguins, ml_annotators, gerrymandering, hipster_dynamics, square_limit, iex_trading, boids, sri_model
    • Definitely no leak (despite heavy usage): nyc_taxi, nyc_buildings, landsat, ship_traffic, attractors (probably attractors-notebook?), census

Once we deal with the current issues it might be good to set up additional automated monitoring to look for memory leaks, or else we could just sort the monitor by number of restarts and eventually catch those anyway. In the meantime, lots of work to do!

jbednar avatar Aug 27 '21 23:08 jbednar