prometheus-slurm-exporter
prometheus-slurm-exporter copied to clipboard
Is this still maintained?
I see that the last commit to main was in March of 2022. I also see a lot of outstanding PR's. Does this mean the repo is not maintained anymore? Is there a dependable fork to rely on?
We have definitely less time than in the past to contribute and integrate the latest PRs submitted.
Since it may be useful as reference for further requests, I will highlight the major issues here:
- When we initially developed this exporter, we have only a single cluster in production that we could use and we were using an extremely old version of Slurm (14.x). Things changed roundabout three years ago and we switched to a more recent version (18.x) although nowadays it's also quite dated, especially for certain aspects (e.g. GPU support).
- Aside from (sometimes) adjusting the code in the PRs to better fit the main code base, the main problem is that we lack a consistent test environment. It is relatively easy to setup a virtual Slurm cluster but it is usually not sufficient to test consistently every PR unless we go into production or (as we did so far) we will implicitly trust the user's PR and assume that it will work without breaking things down from other users (of course, testing that the code can be compiled into a binary and that it pass the tests is also trivial).
- Which bring us to the current status: there are two major branches now, the main one where the exporter is guaranteed to work with version of Slurm till 18.x and there is a development branch which is recommended for users of newer versions of Slurm (between 19.x and 21.x).
- My personal opinion is that this exporter is somehow becoming more and more difficult to develop: everything is based on correctly parsing the output of multiple command line utilities (e.g.
squeue,sinfo, etc.). Even when regular expressions are used, it is not the first time that we were forced to amend the code and introduce corrections to deal with it. As more people asks for more features, I am not so sure this approach will work best. - Further development of Slurm monitoring will most likely be based on the REST API provided by SchedMD in the latest version of Slurm.
I am planning to do a latest round of PRs integration in the coming weeks but I am expecting that sooner or later some of the forks will be further ahead of us.
I see. That explains a lot. We recently had to modify the exporter for our purposes and found it more cumbersome then we would've liked. Especially since there is both a C and REST API. I have a couple thoughts on how the code could be restructured to make adding and removing a subset of features far more modular. Came to the same conclusion about tests as well.
Thank you for your contributions. Maintaining a open-source project is never easy and I appreciate all the hard-work. Completely understand that no one can maintain a library forever and priorities change.
In the meantime, if anyone has a fork that has already incorporated the changes above, I'd love to take a look. If we end up maintaining our own version we will pin the fork here.
I hope to take on the challenge of converting this exporter to use the REST API early/mid next year. If someone else gets to it, happy to use their implementation, otherwise I'll follow up here when it's in a workable state. I don't plan to add backwards-compatibility, as I will be writing it against the newest slurm version.
any updates on this as we have just added slrum to our arsenal and would love the amazing overview by the dashboards this would allow is to make in grafana
I implemented an exporter that implements most of the features of this exporter. No GPU or scheduler stats as our company has no use for them, but we implement pretty much everything else. We plan on open-sourcing it in the next week.
sounds interesting!
Hi guys, we are actively maintaining a JSON-based, hopefully, more maintainable/tested/testable fork here: rivosinc/prometheus-slurm-exporter. It's a complete rewrite. Feel free to contribute. Our next steps are adding JSON-based licensing support as well as implementing some interfaces for slurmrestd support as the same openapi plugin is used for both the cli and restd. Will publish a grafana template soon
It comes with some extra goodies like client-side throttling, job tracing, and more, but also doesn't yet implement things like gpu support, fairshare, or daemon stats
Hi guys, we are actively maintaining a JSON-based, hopefully, more maintainable/tested/testable fork here: rivosinc/prometheus-slurm-exporter. It's a complete rewrite and forked only to show history. Feel free to contribute. Our next steps are adding JSON-based licensing support as well as implementing some interfaces for slurmrestd support as the same openmp api is observed for both the cli and restd. Will publish a grafana template soon
It comes with some extra goodies like client-side throttling, job tracing, and more, but also doesn't yet implement things like gpu support, fairshare, or daemon stats
The repository is empty ATM.
Yeah, sorry about that. We have to go through a OSS review process, so I had to briefly take it down. It should be up again momentarily. I apologize
Howdy guys, the exporter was cleared and is back up. Will release a default template dashboard soon as well. It's my first go project that I contributed to from scratch. Feel free to make issue if you guys think that things can be written better, including nit picks. Would love any feedback