docs Add Open OnDemand exporter

Adds Open OnDemand (http://openondemand.org/) exporter that collects application specific metrics. This is a NSF funded application typically deployed by HPC centers.

Nov 06 '20 19:11 treydock

Thanks for your PR. This exporter seems like an uberexporter, which are not the recommended pattern for exporters, as it includes an Apache exporter - which we already list one of. An exporter should target just one application, and avoid hardcoded business logic (e.g. https://github.com/OSC/ondemand_exporter/blob/master/collectors/apache.go#L173) and assumption about environment (e.g.). I'm seeing similar for nginx and passenger.

Can you explain the architecture of what this is meant to be monitoring? I'm not seeing an "OnDemand" process which this is monitoring, this all seems very deployment specific and better handled by more targeted exporters for the various component systems.

Nov 06 '20 21:11 brian-brazil

Calling this a sort of uberexporter is correct but it was done out of necessity.

So the Apache exporter doesn't scrape actual connection details in such a way to distinguish different types of connections. In OnDemand some Apache connections are sockets and some are not and this exporter makes that distinction. There are counts of client connections that patch specific patterns that match OnDemand instances and counts of OnDemand websocket connections. There are specific base URIs OnDemand uses and one thing I had wanted to do was split those out into individual metrics so match /rnode/ and make it a metric, etc. I can't imagine how to do that with generic Apache exporter.

The NGINX and Passenger exporters won't work for this because of the OnDemand architecture. OnDemand runs the web service as individual users (Per-User-NGINX or PUN). Each PUN belongs to the UID of the person who logged in so if I log into OnDemand through Apache and my username is treydock the owner of the entire web process stack (nginx, passenger, etc) for my session will be my treydock user. Apache is just a proxy mostly for authentication and proxying web traffic from sockets. There is no central HTTP endpoint we can scrape because the URL accessed like /metrics is entirely dependent on who is logged in as identified by Apache REMOTE_USER, so /metrics for one user would be different from another user, each user's session has no way to know about another user's session. Also the passenger exporter won't work because there is no central passenger process, the passenger instances are per-user part of the PUN and only exist as long as the user is not inactive for 2 hours, after which time they are killed.

I had thought about expanding the Apache exporter to give more insight into connections but seemed too specific to the OnDemand application. Also making the Passenger exporter work for this architecture would be a breaking change to that exporter to make it do something that is very non-standard. We've had interest in the OnDemand community in providing like /metric endpoints to scrape but again because of the PUN architecture that's just not possible. I think the only exporter that could possibly replace some of this functionality is the Process exporter but only if a site deploys a very specific config and even then that wouldn't work because this exporter checks the UIDs of all running PUNs and then collections process metrics for all processes belong to those UIDs, which is not possible with the Process exporter. The best we can do with Process exporter is group PUN processes by matching process patterns which I do with this:

---
process_names:
- name: ood-pun
  comm:
  - nginx
  - Passenger
  - Passenger NodeA
  - PassengerAgent
  - ruby

But this would not catch any user developed apps which could potentially have a new process pattern like if a user writes a Python application that is then deployed for OnDemand. Any processes spawned by apps not matching predefined patterns would also be missed but not if you match by UID of who is running the PUN.

Sorry for such verbose explanation, OnDemand is a unique beast and takes a bit of explaining to hopefully make it clear why this exporter exists.

Nov 06 '20 22:11 treydock

That's quite a weird one as this is more a set of applications configured in a particular way, rather than a single component. You're also the vendor here.

Is there some form of central control server (some equivalent of the rubbernecks APIs), or is the relying entirely on standard posix processes/users? How would you gather this sort of information today, what sort of tooling is there?

How many deployments of this exist? Part of the idea of this list is that you're running some 3rd party system, and then want to deploy an exporter for it because it has no provision for Prometheus metrics out of the box. I get the sense that usage of this is very tight knit, such that mentioning it in your own docs may be sufficient.

Nov 06 '20 22:11 brian-brazil

That's quite a weird one as this is more a set of applications configured in a particular way, rather than a single component. You're also the vendor here.

Yes, OnDemand is pretty much a series of individual applications working together. We tie in Apache, NGINX, Passenger and a series of Ruby or NodeJS applications.

Is there some form of central control server (some equivalent of the rubbernecks APIs), or is the relying entirely on standard posix processes/users? How would you gather this sort of information today, what sort of tooling is there?

The PUN architecture, where everything is a standard posix process running as the user who is logged in makes central applications rather difficult. There are tools for collecting a list of running PUNs which this exporter utilizes to know which processes to look at, but other than that there really aren't central tools to manage the install aside from like a tool to setup Apache and a tool to manage PUNs. In theory some of what this exporter does could be moved into the PUN management code but that would only be able to replace possibly the process information and maybe passenger metrics, and would essentially be embedding this exporter into that application.

How many deployments of this exist? Part of the idea of this list is that you're running some 3rd party system, and then want to deploy an exporter for it because it has no provision for Prometheus metrics out of the box. I get the sense that usage of this is very tight knit, such that mentioning it in your own docs may be sufficient.

I would say several hundred deployments, mostly at academic centers or places running High Performance Computing. We are geared towards serving HPC but I think NVIDIA and a few other major companies are working to offer OnDemand as a sort of web portal into products they offer. If this is too niche or not broad enough a product then I think fine to close this out and I can just add this to our product documentation.

Nov 06 '20 23:11 treydock

In theory some of what this exporter does could be moved into the PUN management code but that would only be able to replace possibly the process information and maybe passenger metrics, and would essentially be embedding this exporter into that application.

Would that make sense architecturally, or would it more be wedging it in for the sake of it? I'd kind of expect this sort of thing to be part of management code.

If this is too niche or not broad enough a product then I think fine to close this out and I can just add this to our product documentation.

Nicheness isn't really part of the bar (and a few hundred organisations using this class of application would be well above such a bar anyway), it appearing to be an uberexporter is more the issue as at the surface level it looks like three independent exporters for standard applications that happen to live inside one binary. I'll need to think on this for a bit I think.

Nov 06 '20 23:11 brian-brazil

One thought: Why is this a separate thing, rather than being included in the rpm with everything else OnDemand installs and sets up? If that were the case users would have metrics out of the box, and this could go under the software exposing metrics list.

Nov 07 '20 10:11 brian-brazil

That's an idea I'll have to bring up with our team, to see if we want to bundle this exporter with the RPMs we provide.

Nov 07 '20 18:11 treydock

So far I've gotten positive feedback from folks on the OnDemand project about packaging the exporter as an RPM we provide to people deploying OnDemand. I'll have to build the necessary RPMs and get those pushed to our repos and our documentation updated to reflect the new capabilities. I'll update this pull request with that link once the documentation is uploaded.

Nov 10 '20 14:11 treydock

Sounds good.

Nov 10 '20 15:11 brian-brazil

I've made the Prometheus exporter that interfaces with Open OnDemand into an RPM and have some initial documentation to illustrate how this work here: https://github.com/OSC/ood-documentation/pull/438/files. Hopefully this looks sufficient to be considered software exposing metrics. We typically don't put things into main ondemand RPM unless they part of the core OnDemand code, so things like the default authentication we ship are separate RPMs but all from the same repo.

Feb 20 '21 16:02 treydock