ondemand Switch to using Prometheus for Active Jobs graphs at OSC using chartkick

The job details link can still link to Grafana, but we can use the datasource proxy directly to Prometheus.

The iFrame issue is slow, especially with a job that has many nodes. At home with a spotty internet connection I waited several minutes for the all the graphs of a job to load.

If you have a 10 node job, and 20 iframes (one for each graph), each iframe will generate multiple requests to Prometheus via the datasource proxy. Each CPU graph iframe will do a request for cpu_user, cpu_system, and cpu_total, so for 10 nodes thats 30 requests, each of which take 1-2 seconds.

If we are querying Prometheus directly and handling the graph generation ourselves, we can combine requests by specifying multiple hosts in a single request i.e. cgroup:cpu_user_seconds:irate5m{cluster="owens",host=~"o0559|o0581",jobid="11286307"}. For a job with 10 nodes, this would require 3 requests instead of 30 requests. Futhermore, given that summary graphs we generate ourselves will lose the fidelity and interactivity of the Grafana graphs, we could make these graphs smaller and display more on the screen at a time to help with quick comparison.

We would retain the details link to Grafana dashboard for each node for details. Summary graphs can be generated using https://chartkick.com/ interface which can generate graphs using backends chart.js and highcharts. https://chartkick.com/ is a Ruby gem but there is also an implementation in other languages (JS, Python, etc.). It does use Canvas so we would to provide tabular data as well for accessibility purposes.

┆Issue is synchronized with this Asana task by Unito

Sep 29 '20 15:09 ericfranz

Just be warned I'm not liking the fact anonymous queries work with the data proxy URLs so at some point in the future I may turn off that feature if I can figure out how. There's nothing stopping someone from dumping all the data inside Prometheus which is not something we want everyone to be able to do. Whatever solution you come up with might break in the future, so just be warned.

Sep 29 '20 15:09 treydock

You can further condense down executions with a query like this:

{__name__=~"node_disk_(reads|writes)_completed_total",host="p0016"}

That is harder to read and is a bit of black magic but it allows for a single API call to return multiple metrics. The only time this may get you into trouble is if the query contains too many data points, there are limits we enforce on query sizes, either 2 minutes for individual query or 50 million samples are all the maxes we support (these are defaults).

Sep 29 '20 19:09 treydock

Talked with @treydock offline. A server side query would be more reliable to work after @treydock fixes the issue with clients being able to do direct queries to the datasource. Though there is no timeline on that right now. Of course direct query means less load on the OnDemand host since it hits Grafana/Prometheus directly.

A server side query could also make use of ERB rendering. It might also be more portable to other sites if this was of interest. We would really need to understand though the number of datapoints generated in a timespan so as to specify the correct step in order to ensure we don't have too many points. For a summary graph, at a certain point you don't need every detail - that's what Grafana is for. My understanding is the graphs will display from the start of the job to the end of the job, so of course this could be 1 hour or 20 hours etc.

Sep 29 '20 20:09 ericfranz

Also after doing the work on this, we would have 3 examples - Ganglia, Grafana, and Prometheus graphs and perhaps could come up with a more generic solution for configuring and customizing the per-node views of jobs.

Sep 29 '20 20:09 ericfranz

Another option is instead of replacing the per node iframe charts, we keep them and add two extra panes:

to show all of the CPU graphs from all of the nodes side by side in a grid for a quick view
to show all of the memory graphs from all the nodes side by side in a grid for quick view

Sep 30 '20 17:09 ericfranz

We could easily do that with Prometheus queries and Grafana, just requires a single query that includes all hosts since each host will produce a separate time series even if coming from same single query.

Sep 30 '20 18:09 treydock

@treydock in response to:

We could easily do that with Prometheus queries and Grafana, just requires a single query that includes all hosts since each host will produce a separate time series even if coming from same single query.

Would the Grafana solution put all of the time series on a single graph or put a bunch of graphs side-by-side?

Oct 12 '20 14:10 ericfranz

Note on this issue - talked with @treydock on Slack and he said he can configure Prometheus to accept unauthenticated requests from the OnDemand host so we can do server side Prometheus searches directly.

Oct 12 '20 14:10 ericfranz

Would the Grafana solution put all of the time series on a single graph or put a bunch of graphs side-by-side?

If we query Prometheus using PromQL queries through Prometheus or Grafana, the result is the same since Grafana is just a proxy to Prometheus. If we display graphs from Grafana where Grafana is responsible for rendering the graphs then we can do whatever we want with the data just like if we take the raw timeseries and graph it ourselves.

Oct 12 '20 14:10 treydock

This change would actually improve accessibility, because currently the iframes have controls that introduce many tab stops introducing complexity in navigating the details pane with the keyboard.

Nov 10 '20 16:11 ericfranz

ondemand ondemand copied to clipboard

Switch to using Prometheus for Active Jobs graphs at OSC using chartkick

ondemand
ondemand copied to clipboard