prometheus-slurm-exporter icon indicating copy to clipboard operation
prometheus-slurm-exporter copied to clipboard

Added logging of Slurm subprocess failures and testing for all parsing functions

Open Rovanion opened this issue 4 years ago • 4 comments

Hi, Over the last two weeks I've been working on improving the error messages and testing of the Prometheus Slurm Exporter. I found that if any of the subprocesses failed the main go process terminated, only printing exit 1 before doing so; implying that the go process itself exited with status 1 when it was in fact the subprocess doing so.

I think that I have gone over all parts of the project that execute a subprocess and parses the text outputted by that subprocess. I have refactored these to use a common function/procedure for starting these subprocesses and reporting errors should they fail. I have also refactored them to have a pure function for parsing the output of the subprocess and have added unittests for all of these pure functions. I have also added system tests for the impure functions that execute sinfo, squeue and so on.

Over all the result is that there are now unit tests for all parser functions and system tests for all the subprocess launching functions. The subprocess launching functions now report the stderr of the subprocess should they fail.

This PR also encompasses the two previous PR's for fixing the formatting and splitting the old tests into system and unit tests. It was too much work rebasing it on master and going through every single commit to undo the autoformatting.

Why did I do all this? Because it was impossible to figure out why the system tests failed without this work and I needed to know this in order to write system tests for Nix. It was also the case that the unit tests would never fail since they never asserted anything.

Rovanion avatar Mar 02 '21 10:03 Rovanion

Is there anything that I can do to help this PR get merged @vpenso?

Rovanion avatar Mar 30 '21 07:03 Rovanion

This branch is starting to bitrot. As the author I obviously think it brings with it significant improvements and would suggest I rebase it on master and you merge it in before it rots even more. The value of code deduplication and having actual tests for the codebase should outweigh the risk of introduced bugs.

Rovanion avatar Jul 21 '21 07:07 Rovanion

@Rovanion : yes, if you want to rebase your PR on the current master branch then now is probably a good time since in August it is unlikely that we will add/modify the code base. Note that the next overhaul may be in the near future when we will integrate our new GPU nodes.

One advice: if possible, try to remove entirely from your PR any kind of formatting adjustments of the source code, since usually it tends to make the review of external contributions quite 'painful' to disentangle from the actual enhancement.

The value of code deduplication and having actual tests for the codebase should outweigh the risk of introduced bugs.

Well, having more tests will help but keep in mind that this exporter is targeting HPC clusters of different size and complexity, and usually sysadmins in charge take a very conservative approaches in terms of updating something that it works. And nobody like to have a piece of their monitoring falling on itself or starting to report wrong stats.

mtds avatar Jul 30 '21 07:07 mtds

I, erm, did not end up rebasing but merging in the changes from vpenso/master and manually reverted almost all style changes I could find; some remaining as I couldn't find the cause of the diff. I hope it will be a lighter read now.

We're hoping to put this into production on our HPC cluster this fall after upgrading to Slurm 20.11. I think we both want the same thing here, reliable monitoring of our clusters. I just value testing and code reuse more than absence of code changes.

Is there anything more that I can help with to get these changes merged?

Rovanion avatar Aug 04 '21 20:08 Rovanion