prometheus-slurm-exporter icon indicating copy to clipboard operation
prometheus-slurm-exporter copied to clipboard

Slurm exporter crashes on Slurm 20.11.8

Open DImuthuUpe opened this issue 3 years ago • 15 comments

When I try to run curl http://localhost:8080/metrics on the latest build of the exporter, I see the following error message. Is there a fix for this?

panic: runtime error: index out of range [4] with length 4

goroutine 12 [running]: main.ParseNodeMetrics(0xc0003c6000, 0x1f9, 0x600, 0x0) /opt/prometheus-slurm-exporter/node.go:56 +0x6d6 main.NodeGetMetrics(0x0) /opt/prometheus-slurm-exporter/node.go:40 +0x2a main.(*NodeCollector).Collect(0xc0000ab710, 0xc0001a2660) /opt/prometheus-slurm-exporter/node.go:128 +0x37 github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1() /root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather /root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:454 +0x5ce

DImuthuUpe avatar Jan 23 '22 01:01 DImuthuUpe

Are you using -gpus-acct? I found that gpus.go needs to be updated.

- --format=Allocgres
+ --format=AllocTRES

Here's the error if you run the sacct command as is:

> sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES

wmoore28 avatar Jan 31 '22 20:01 wmoore28

If you are using --gpus-acct try #73.

itzsimpl avatar Mar 04 '22 16:03 itzsimpl

Hi all,

I am having the same issue. I tried both solutions (@wmoore28 and @itzsimpl). My slurm version is 21.08.5

Does anyone have any workaround for the panic array length? The process/service is killed with this panic.

PS.: The got test *.go runs fine for all test

Thanks

JaderGiacon avatar Mar 22 '22 17:03 JaderGiacon

Reviewing the error listed in the first post this has nothing to do with gpus.go, but with node.go. Try running the command sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong on your system and examine the output. This is what node.go executes and then parses. Compare your output with what is in sinfo_mem.txt.

itzsimpl avatar Mar 22 '22 19:03 itzsimpl

This fixed it in my environment. I can send PR if it fixes the issue globally https://github.com/DImuthuUpe/prometheus-slurm-exporter/commit/3adffd8d475b52081d7e8172fd8611c8769c0744

DImuthuUpe avatar Mar 23 '22 00:03 DImuthuUpe

Hi All,

Thanks for the quick reply. The output of the command: sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong Just some entries, because we have hundreds instances in cloud as well:

Please, consider the space rendered as a 'tab'

cloudsmp-r548-40 0 756878 0/48/0/48 idle~ cloudsmp-r548-41 0 756878 0/48/0/48 idle~ cloudsmp-r548-42 0 756878 0/48/0/48 idle~

node02 386016 386385 48/0/0/48 allocated node03 386000 386385 48/0/0/48 allocated node03 386000 386385 48/0/0/48 allocated node03 386000 386385 48/0/0/48 allocated

Comparing with sinfo_mem.txt it seems to be equal (5 columns)

Second test was run the @DImuthuUpe command: sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong:"

The difference from the previous command it is a single space instead of a tab

cloudsmp-r548-37 0 756878 0/48/0/48 idle~ cloudsmp-r548-38 0 756878 0/48/0/48 idle~ cloudsmp-r548-39 0 756878 0/48/0/48 idle~ cloudsmp-r548-40 0 756878 0/48/0/48 idle~

node01 135744 386385 37/11/0/48 mixed node01 135744 386385 37/11/0/48 mixed node01 135744 386385 37/11/0/48 mixed

I have updated the node.go as suggested by @DImuthuUpe, compiled and run in foreground: bin/prometheus-slurm-exporter -gpus-acct INFO[0000] Starting Server: :8080 source="main.go:59" INFO[0000] GPUs Accounting: true source="main.go:60" FATA[0026] exit status 1 source="gpus.go:101"

Then I amended the sacct command as suggested by @wmoore28 and it worked fine.

In resume, there are two changes:: 1 - nodes.go as suggested by @DImuthuUpe 2 - gpus.go for who use -gpus-acct parameter as suggested by @wmoore28

A good improvement could be an automated test to sacct command (in this case for gpus): sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2 sacct: fatal: AllocGRES is deprecated, please use AllocTRES

Thanks everyone for the quick help. I will keep an eye on new versions available here with fixes applied. I am pretty sure @vpenso will find a way to keep back compatibility in slurm versions.

Thanks again!

JaderGiacon avatar Mar 23 '22 11:03 JaderGiacon

@JaderGiacon I have integrated the fix from @DImuthuUpe, and also patched gpus.go as there was an issue when gres does not use gpuType. Could you perhaps test if it works for you?

itzsimpl avatar Mar 24 '22 10:03 itzsimpl

Thanks, @itzsimpl for integrating the fix

DImuthuUpe avatar Mar 24 '22 14:03 DImuthuUpe

Hi @itzsimpl,

I have compiled the version present in your github (https://github.com/itzsimpl/prometheus-slurm-exporter) and it worked fine. The process is not being killed and the gpus information is coming well. Also, the go test worked fine.

Thank you so much for it!

JaderGiacon avatar Mar 25 '22 14:03 JaderGiacon

@all : I have merged the updated PR #73 into the development branch

The master branch will be kept backward compatible to Slurm version (up to) 18.x.

mtds avatar Mar 29 '22 13:03 mtds

Hello! Are there any news and plans when this fix from the development branch will be released on the master/main branch?

Sebbo94BY avatar Jun 22 '22 14:06 Sebbo94BY

@Sebi94nbg -- @vpenso and @mtds seem to have gone off line. Another bug fix about 20+ character node names with PR that was offered in MAY has also not been merged into main. 🤷

gbeyer3 avatar Jul 12 '22 19:07 gbeyer3

Hello! Are there any news and plans when this fix from the development branch will be released on the master/main branch?

Possibly choosing the name development for the other branch was not the right one. The PR #73 was not merged into master because we would like to keep backward compatible for the version of Slurm we are still using (18.x) and this PR is not meant for old versions.

Additional PRs, which requires new and/or updated functionalities of Slurm will be merged only into the development branch. If you are using newer version of Slurm (from 20.x onwards) our suggestion is to skip the master branch and only use development.

mtds avatar Jul 14 '22 12:07 mtds

Thanks for your feedback and information.

It's unfortunately not enough to only have a proper branch name here.

We need a tagged release, so that we can pick a specific version of this project to ensure, that the exporter is always the same version. Simply using the branch could lead to different installations and behaviours, when someone pushes / merges changes into this branch.

That's the reason, why we currently for example use this branch, but define a specific commit to avoid different version deployments.

Currently, you're simply incrementing your release version. What about these release tags?

  • Releases 0.x => Slurm 18.x
  • Releases 1.x => Slurm 20.x

Then you would keep the backward compatibility and users could decide, which version / release they need or want.

In addition, you may also want to rename the development branch to something like slurm-20.x.

Sebbo94BY avatar Jul 17 '22 18:07 Sebbo94BY

I have still the same Problem with slurm 22.05.5.1. The node exporter crashed after few moments.

imoes avatar May 10 '23 11:05 imoes