galaxy [WIP] Collect CPU usage and memory consumption periodically over the life of a job

In order to support this, I've added a new metric type, the "file" metric. The reasoning is that this is probably going to be too much data to put in to the DB.

The data is stored in a file like so:

#time cpuacct.usage memory.memsw.usage_in_bytes
1524774786 24065921 3096576
1524774791 27750993 3133440
1524774796 33209022 3080192
1524774801 37397392 2990080
1524774806 43464228 2957312
1524774811 51181396 2977792
1524774816 56100369 2949120

Only the time column (UNIX epoch time in UTC) is fixed, the others can be in different orders or not present at all (since cgroups will not always provide them), which is why there's a header.

cpuacct.usage is cumulative nanoseconds. In a visualization this might be better to view as the derivative. memory.memsw.usage_in_bytes will increase and decrease as memory is consumed and freed.

Questions for people who are good at this kind of stuff:

Is this a good format for the data, with visualization in mind?
Should we create the visualization on the server or the client?
Alternatively, should we just push this data in to InfluxDB and allow users to visualize these in Grafana somehow? Sounds like a nightmare to try to map them and ensure privacy, but my knowledge here is limited. (@erasche?)
Will Pulsar stage the script over? I suspect the answer is no? Can I just register it? (@jmchilton?)

If anyone is willing to devote some time for the visualization piece, I'd really appreciate it. I think this could be a killer feature.

TODO

[ ] Store the metrics data files compressed (configurable for people w/ transparent compression, default to "on")
[ ] API access to the data
[ ] Create a "mark" of some kind between the tool and set_metadata

xref: #5988 #4862

Apr 26 '18 20:04 natefoo

I think this could be a killer feature.

👍I second that!

Apr 27 '18 07:04 scholtalbers

Is this a good format for the data, with visualization in mind?

time series stored as csv/tsv on disk is probably the easiest to work with. there are other formats but anything else is going to require more processing/libraries/etc.

Should we create the visualization on the server or the client?

Client would be good. Especially if we're exposing an API. I mean we could write an SVG endpoint but..yeah. javascript all the things I guess.

Alternatively, should we just push this data in to InfluxDB and allow users to visualize these in Grafana somehow? Sounds like a nightmare to try to map them and ensure privacy, but my knowledge here is limited. (@erasche?)

Some unsorted thoughts:

grafana is great at realtime data, it's less great for "show me this historical data" (IMO.)
- This is fixed by ensuring that when we construct the graph, we're sure to include the correct time range of the tool run to display. It's kinda funky but it'd work.
- If we did timestamps from 0 and just set grafana to always show data from 1970 then they'd all overlap nicely and you could see the 'average profile' of the tool but that's also weird and loses the feature of "show me the composition of the CPU usage at time X".
Grafana supports JSON data sources which is nice. We could leverage this feature here:
- have galaxy expose an API compatible with Grafana's queries
- embed the graph with 'job id' as a configurable parameter, passing that parameter gets put in grafana's graph and thus in the requests to galaxy
But that isn't a general solution. That's great for one galaxy because grafana has to be configured for that. Also that dashboard has to be publicly viewable (but it has no data without us passing in the job id). (You could do fancy things in the data source that would make this more general, but then it'd only apply to publicly available galaxies and that isn't optimal.)
I don't think grafana is the right tool for the job for graphing in this case. If admins want to send data there after the fact, absolutely they'll be able to do that from the data files on disk.
If we're just loading a graph with a job id in the URL, then no privacy implications. Since the user would have to know that job id ahead of time to construct the URL. Not like we'd be saying "here are all of the jobs listed by user".

It's nice for the end users to see these graphs because users love shiny things. But is that the primary purpose of this? My gut feeling says this is important for scheduling concerns, but none of us are to the point yet where we can predict runtime + then say "oh, we have 50 X jobs which will only use 1GB of ram until minute 30 and then use 100GB so we can schedule high-ram things on that node in the meantime." None of us are there.

So I guess that leaves the only practical, short-term application of this as shiny graphs for users and admins to maybe have aggregate visualisations for "average" cpu/mem behaviour of tool X. I'm not sure it'd be actionable data but, it's a nice feature for sure. Something interested users will enjoy.

Apr 27 '18 07:04 hexylena

@natefoo I basically agree with everything @erasche said about it not being actionable really for now and that researchers cannot really do much with data except enjoy that it is pretty. For that reason I'd much rather see you spending your time working on great integrations with some external admin-y product than building charts for researchers in Galaxy. But follow your passion of course. I will throw this out there, if you made it work with the charts plugin you'd be doing these two cool things:

Making sure that whole charts component is more amenable to non-dataset applications - which there may be more of in Galaxy that could leverage charts (e.g. reports).
Building a good API and putting the view and rendering stuff on the frontend - these seem like directions we are trying to take Galaxy anyway.

Apr 27 '18 10:04 jmchilton

Thanks for putting such thought and time in to your reply @erasche, I agree with all of this as well.

I will add, there's one other potential benefit for admins down the line: some tools have different stages, and these stages consume different amounts of memory/CPU. Some of those tools could potentially be broken up in to separate multiply-schedulable steps. Galaxy doesn't support this currently (although it has tasks, but no way for tools to interface with them directly). Once the functionality exists we could see some real benefits in throughput by breaking such tools up. This data will tell us what tools are the best candidates.

Still not actionable now, of course.

@jmchilton, you're right that I don't really want to spend time building charts in Galaxy. But my hope is that if I build some core pieces, it'll allow someone more adept at viz stuff to do that part of it pretty quickly/easily.

What would a good API for serving this data look like? I'm guessing it'd need addressable access to the data? By time or offset? I wonder if, instead of a special new model object we should store this as a metadata file and create a new datatype for it so we can reuse all the existing dataset/data provider code that does this stuff?

Apr 27 '18 14:04 natefoo

Some of those tools could potentially be broken up in to separate multiply-schedulable steps.

oh, definitely. that always seemed like a good idea. e.g. separate out samtools conversion step from other tools. yeah, would be nice to have profiles for the stages.

What would a good API for serving this data look like? I'm guessing it'd need addressable access to the data? By time or offset?

offset would probably be nicer than time for graphing, I would imagine. I guess something like this but that's pretty basic and maybe not what people writing JS would want? Who knows. Probably unqualified to be replying to that subpoint.

$ curl https://.../api/jobs/deadbeefcafe/metrics
{
  "cpu": [
    [0, 24065921, 3096576],
    [5, 12831082, 1231231]
  ],
  "..."
}

Apr 27 '18 15:04 hexylena

Some of those tools could potentially be broken up in to separate multiply-schedulable steps.

oh, definitely. that always seemed like a good idea. e.g. separate out samtools conversion step from other tools.

This could be implemented at the tool level by introducing a <commands> element which could contain multiple <command> elements.

May 01 '18 14:05 nsoranzo

@nsoranzo: @jxtx proposed this ~10 years ago. We'll get to it very soon I'm sure. ;)

May 03 '18 14:05 natefoo

galaxy galaxy copied to clipboard

[WIP] Collect CPU usage and memory consumption periodically over the life of a job

galaxy
galaxy copied to clipboard