prometheus-slurm-exporter gpu version

gpu version

Open titansmc opened this issue 3 years ago • 25 comments

Hi, I am testing the latest version and GPU infor seems to not be so accurate, how can I start debugging?

# TYPE slurm_gpus_alloc gauge
slurm_gpus_alloc 21
# HELP slurm_gpus_idle Idle GPUs
# TYPE slurm_gpus_idle gauge
slurm_gpus_idle -21
# HELP slurm_gpus_total Total GPUs
# TYPE slurm_gpus_total gauge
slurm_gpus_total 0
# HELP slurm_gpus_utilization Total GPU utilization
# TYPE slurm_gpus_utilization gauge
slurm_gpus_utilization +Inf

Cheers.

Feb 08 '21 11:02 titansmc

The GPU module is basically a wrapper which parses the output of the following commands:

Allocated GPUs

sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2

Total available GPUs

sinfo -h -o "%n %G"

What these commands reports to you? Since the total number of GPUs reported in your case is 0, it is not surprising that the GPU utilization metrics (which the Go module calculate as 'allocated' divided by 'total' GPUs) goes to infinite. This also explain why 'Idle GPUs' is negative: it is evaluated as the total minus the allocated GPUs.

If the Slurm commands reports the same results that the exporter is showing, then there is something in your configuration that has to be verified (eventually on the commands we are using too). If it's not the case then there is something wrong in the logic of this module and we have to look deeper into it.

Feb 08 '21 12:02 mtds

the first command just outputs a bunch of blank lines and in between a few times:




gpu:2




gpu:1

about the second one:

[root@~]# sinfo -h -o "%n %G"
sb01-13 tmp:844G
sb01-01 tmp:844G
sb01-02 tmp:844G
sb01-03 tmp:844G
sb01-04 tmp:844G
sb01-05 tmp:844G
sb01-06 tmp:844G
sb01-07 tmp:844G
sb01-08 tmp:844G
sb01-09 tmp:844G
sb01-10 tmp:844G
sb01-11 tmp:844G
sb01-12 tmp:844G
sb01-14 tmp:844G
sb01-15 tmp:844G
sb01-16 tmp:844G
sb01-17 tmp:844G
sb01-18 tmp:844G
sb01-19 tmp:844G
sb01-20 tmp:844G
sb02-01 tmp:844G
sb02-02 tmp:844G
sb02-03 tmp:844G
sb02-04 tmp:844G
sb02-05 tmp:844G
sb02-06 tmp:844G
sb02-07 tmp:844G
sb02-08 tmp:844G
sb02-09 tmp:844G
sb02-10 tmp:844G
sb02-11 tmp:844G
sb02-12 tmp:844G
sb02-13 tmp:844G
sb02-14 tmp:844G
sb02-15 tmp:844G
sb02-16 tmp:844G
sb02-17 tmp:844G
sb02-18 tmp:844G
sb02-19 tmp:844G
sb02-20 tmp:844G
sb03-02 tmp:467G
sb03-03 tmp:467G
sb03-04 tmp:467G
sb04-02 tmp:467G
sb04-03 tmp:467G
sb04-04 tmp:467G
sb04-05 tmp:467G
sb04-06 tmp:467G
sb04-07 tmp:467G
sb04-08 tmp:467G
sb04-09 tmp:467G
sb04-10 tmp:467G
sb04-11 tmp:467G
sb04-12 tmp:467G
sb04-13 tmp:467G
sb04-14 tmp:467G
sb04-15 tmp:467G
sb04-16 tmp:467G
sb04-17 tmp:467G
sb04-18 tmp:467G
sb04-19 tmp:467G
sb04-20 tmp:467G
sb05-02 tmp:467G
sb05-03 tmp:467G
sb05-04 tmp:467G
sb05-05 tmp:467G
sb05-06 tmp:467G
sb05-07 tmp:467G
sb05-08 tmp:467G
sb05-09 tmp:467G
sb05-10 tmp:467G
sb05-11 tmp:467G
sb05-12 tmp:467G
sb05-13 tmp:467G
sb05-14 tmp:467G
sb05-15 tmp:467G
sb05-16 tmp:467G
sb05-17 tmp:467G
sb05-18 tmp:467G
sb05-19 tmp:467G
sb05-20 tmp:467G
sm-epyc-01 tmp:7571G
sm-epyc-02 tmp:9400282M
sm-epyc-03 tmp:9400282M
sm-epyc-04 tmp:9400282M
sm-epyc-05 tmp:9400282M
smer01-1 tmp:203G
smer01-2 tmp:203G
smer01-3 tmp:203G
smer01-4 tmp:203G
smer02-1 tmp:203G
smer02-2 tmp:203G
smer02-3 tmp:203G
smer02-4 tmp:203G
smer03-1 tmp:203G
smer03-2 tmp:203G
smer03-3 tmp:203G
smer03-4 tmp:203G
smer04-1 tmp:203G
smer04-2 tmp:203G
smer04-3 tmp:203G
smer04-4 tmp:203G
smer05-1 tmp:203G
smer05-2 tmp:203G
smer05-3 tmp:203G
smer05-4 tmp:203G
smer06-1 tmp:203G
smer06-2 tmp:203G
smer06-3 tmp:203G
smer06-4 tmp:203G
smer07-1 tmp:203G
smer07-2 tmp:203G
smer07-3 tmp:203G
smer07-4 tmp:203G
smer08-1 tmp:203G
smer08-2 tmp:203G
smer08-3 tmp:203G
smer08-4 tmp:203G
smer09-1 tmp:203G
smer09-2 tmp:203G
smer09-3 tmp:203G
smer09-4 tmp:203G
smer10-1 tmp:203G
smer10-2 tmp:203G
smer10-3 tmp:203G
smer10-4 tmp:203G
smer11-1 tmp:203G
smer11-2 tmp:203G
smer11-3 tmp:203G
smer11-4 tmp:203G
smer12-1 tmp:203G
smer12-2 tmp:203G
smer12-3 tmp:203G
smer12-4 tmp:203G
smer13-1 tmp:203G
smer13-2 tmp:203G
smer13-3 tmp:203G
smer13-4 tmp:203G
smer14-1 tmp:203G
smer14-2 tmp:203G
smer14-3 tmp:203G
smer14-4 tmp:203G
smer15-1 tmp:203G
smer15-2 tmp:203G
smer15-3 tmp:203G
smer15-4 tmp:203G
smer16-1 tmp:203G
smer16-2 tmp:203G
smer16-3 tmp:203G
smer16-4 tmp:203G
smer17-1 tmp:203G
smer17-2 tmp:203G
smer17-3 tmp:203G
smer17-4 tmp:203G
smer18-1 tmp:203G
smer18-2 tmp:203G
smer18-3 tmp:203G
smer18-4 tmp:203G
smer19-1 tmp:203G
smer19-2 tmp:203G
smer19-3 tmp:203G
smer19-4 tmp:203G
smer20-1 tmp:203G
smer20-2 tmp:203G
smer20-3 tmp:203G
smer20-4 tmp:203G
smer21-1 tmp:203G
smer21-2 tmp:203G
smer21-3 tmp:203G
smer21-4 tmp:203G
smer22-1 tmp:203G
smer22-2 tmp:203G
smer22-3 tmp:203G
smer22-4 tmp:203G
smer23-1 tmp:203G
smer23-2 tmp:203G
smer23-3 tmp:203G
smer23-4 tmp:203G
smer24-1 tmp:203G
smer24-2 tmp:203G
smer24-3 tmp:203G
smer24-4 tmp:203G
smer25-1 tmp:203G
smer25-2 tmp:203G
smer25-3 tmp:203G
smer25-4 tmp:203G
smer26-1 tmp:203G
smer26-2 tmp:203G
smer26-3 tmp:203G
smer26-4 tmp:203G
smer27-1 tmp:203G
smer27-2 tmp:203G
smer27-3 tmp:203G
smer27-4 tmp:203G
smer28-1 tmp:203G
smer28-2 tmp:203G
smer28-3 tmp:203G
smer28-4 tmp:203G
smer29-1 tmp:203G
smer29-2 tmp:203G
smer29-3 tmp:203G
smer29-4 tmp:203G
smer30-1 tmp:203G
smer30-2 tmp:203G
smer30-3 tmp:203G
smer30-4 tmp:203G
gpu4 gpu:1080Ti:8(S:0-1),tmp:1127G
gpu5 gpu:1080Ti:8(S:0-1),tmp:1127G
gpu8 gpu:2080Ti:8(S:0),tmp:3100G
gpu10 gpu:V100:4,tmp:456G
gpu11 gpu:2080Ti:4,tmp:467G
gpu12 gpu:2080Ti:4,tmp:467G
gpu13 gpu:2080Ti:4,tmp:467G
gpu14 gpu:2080Ti:4,tmp:467G
gpu15 gpu:2080Ti:4,tmp:467G
sb03-05 gpu:A100:1,tmp:467G
gpu9 gpu:2080Ti:8(S:0),tmp:3100G
gpu16 gpu:2080Ti:4,tmp:467G
gpu17 gpu:2080Ti:4,tmp:467G
gpu18 gpu:2080Ti:4,tmp:467G
gpu19 gpu:2080Ti:4,tmp:467G
gpu20 gpu:2080Ti:4,tmp:467G
sb03-06 gpu:A100:1,tmp:467G
sb03-07 gpu:A100:1,tmp:467G
sb03-08 gpu:A100:1,tmp:467G
sb03-09 gpu:A100:1,tmp:467G
sb03-10 gpu:A100:1,tmp:467G
sb03-11 gpu:A100:1,tmp:467G
sb03-12 gpu:A100:1,tmp:467G
sb03-13 gpu:A100:1,tmp:467G
sb03-14 gpu:A100:1,tmp:467G
sb03-15 gpu:A100:1,tmp:467G
sb03-16 gpu:A100:1,tmp:467G
sb03-17 gpu:A100:1,tmp:467G
sb03-18 gpu:A100:1,tmp:467G
sb03-19 gpu:A100:1,tmp:467G
sb03-20 gpu:A100:1,tmp:467G
bn01 (null)
bn02 (null)
bn03 (null)
bn04 (null)
sb04-01 (null)

Feb 08 '21 13:02 titansmc

The GPU module is basically a wrapper which parses the output of the following commands:

Allocated GPUs
sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2

Depending on which version you are running, on slurm 20.04 Allocgres is replaced with AllocTRES. The grafana json rev3, however, doesn't have anything for GPU yet @mtds

Feb 17 '21 08:02 biocyberman

I am running 20.02

Feb 17 '21 10:02 titansmc

@titansmc : take a look into issue #40. There is a fix that can maybe be helpful in your case.
@biocyberman : no, unfortunately the Grafana interface is not yet able to show information about GPUs. We are still waiting for the proper HW on our cluster, so I do not have the chance (at the moment) to do some tests in this regard.

Mar 04 '21 18:03 mtds

@mtds I doesn't work, it produces output like:

[root@lrms1 ~]# sacct -a -X --format=AllocTRES --state=RUNNING --noheader --parsable2
billing=6,cpu=6,mem=35G,node=1
billing=1,cpu=1,mem=2G,node=1
billing=1,cpu=1,mem=2G,node=1
billing=1,cpu=1,mem=2G,node=1

while Gres :

[root@lrms1 ~]# sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2


gpu:2







gpu:2



gpu:2

What is what I should expect from that command ?

Mar 05 '21 09:03 titansmc

What is what I should expect from that command ?

I don't have an answer for that, since it is highly dependent on your Slurm configuration related to the GPUs and there is also issue #40 in the middle.

@JoeriHermans : can you eventually offer an insight about the output format? We do not have enough GPUs now to change our configuration accordingly and make a test. May you eventually provide a test data for the output as well? Like the other *_test.go files.

Mar 10 '21 10:03 mtds

This patch seems to work for me:

diff --git a/gpus.go b/gpus.go
index ca3bcaf..9e90421 100644
--- a/gpus.go
+++ b/gpus.go
@@ -38,15 +38,19 @@ func GPUsGetMetrics() *GPUsMetrics {
 func ParseAllocatedGPUs() float64 {
 	var num_gpus = 0.0
 
-	args := []string{"-a", "-X", "--format=Allocgres", "--state=RUNNING", "--noheader", "--parsable2"}
+	args := []string{"-a", "-X", "--format=AllocTRES", "--state=RUNNING", "--noheader", "--parsable2"}
 	output := string(Execute("sacct", args))
 	if len(output) > 0 {
 		for _, line := range strings.Split(output, "\n") {
 			if len(line) > 0 {
 				line = strings.Trim(line, "\"")
-				descriptor := strings.TrimPrefix(line, "gpu:")
-				job_gpus, _ := strconv.ParseFloat(descriptor, 64)
-				num_gpus += job_gpus
+				for _, resource := range strings.Split(line, ",") {
+					if strings.HasPrefix(resource, "gres/gpu=") {
+						descriptor := strings.TrimPrefix(resource, "gres/gpu=")
+						job_gpus, _ := strconv.ParseFloat(descriptor, 64)
+						num_gpus += job_gpus
+					}
+				}
 			}
 		}
 	}
@@ -63,11 +67,17 @@ func ParseTotalGPUs() float64 {
 		for _, line := range strings.Split(output, "\n") {
 			if len(line) > 0 {
 				line = strings.Trim(line, "\"")
-				descriptor := strings.Fields(line)[1]
-				descriptor = strings.TrimPrefix(descriptor, "gpu:")
-				descriptor = strings.Split(descriptor, "(")[0]
-				node_gpus, _ :=  strconv.ParseFloat(descriptor, 64)
-				num_gpus += node_gpus
+				gres := strings.Fields(line)[1]
+				// gres column format: comma-delimited list of resources
+				for _, resource := range strings.Split(gres, ",") {
+					if strings.HasPrefix(resource, "gpu:") {
+						// format: gpu:<type>:N(S:<something>), e.g. gpu:RTX2070:2(S:0)
+						descriptor := strings.Split(resource, ":")[2]
+						descriptor = strings.Split(descriptor, "(")[0]
+						node_gpus, _ :=  strconv.ParseFloat(descriptor, 64)
+						num_gpus += node_gpus
+					}
+				}
 			}
 		}
 	}

Mar 10 '21 20:03 lahwaacz

Thanks for the patch but as I wrote above, I do not have the chance to run a test on a cluster with GPUs at the moment.

I assume that this patch will work (e.g. no obvious syntax errors) but I am wary of integrating it right now. I would need at least other two persons who can test it on their configurations.

Mar 10 '21 20:03 mtds

I can confirm the same problem with slurm version 20.11.2. So far I have only changed the single line to use "AllocTRES" argument. @lahwaacz @mtds which grafana dashboard works with the patch?

Mar 15 '21 22:03 crinavar

just updated to this on our system.

"sinfo -h -o %n %G" command takes the scrape from 1/2 seconds to 2.5 minutes on our system, and it has no gpus.

'%n' with means you have to examine every node in the system to see if has a GPU; and our Cray has over 10k nodes with no GPU's.

Is there way to disable this? I changed it from %n to %N and everything is much faster now.

Mar 18 '21 06:03 ThomasADavis

Is there way to disable this? I changed it from %n to %N and everything is much faster now.

At the moment there is no way to turn it off but given the mixed result of this patch, I believe I will add a command line switch to explicitly turn it on otherwise by default it will be disabled.

The fact that you change the options and the sinfo command is so much faster makes me wonder.

According to the man page of sinfo, those options are doing the following:

%n
    List of node hostnames. 
%N
    List of node names.

So, with %n you'll get a complete list:

host001
host002
[...]

while with %N a compressed list of the hosts is printed, like the following:

host0[01-10]

No wonder in the second case it's faster but it may depend on the length of the output but I am not sure how sinfo is going through the list internally (naively thinking it should check the configuration files, since it is not mentioned anywhere that this command will issue RPC calls to slurmctld).

How many nodes (approximately) do you have on your cluster? We never tested this exporter with more than 800 nodes, so I cannot say how much performant those sinfo commands are on very big installations.

Mar 18 '21 07:03 mtds

[...] which grafana dashboard works with the patch?

@crinavar : you can try the one I made, available here: https://grafana.com/grafana/dashboards/4323

Note: there are no graph panels (yet) for GPUs, since we do not have so much of that HW in our current installation so I did not have the chance to create an additional dashboard so far.

Mar 18 '21 09:03 mtds

while with %N a compressed list of the hosts is printed, like the following:
host0[01-10]

This also means that if there were some GPUs on these nodes, they would not be counted correctly. The exporter expects one node per line and does not know that e.g. host0[01-10] is 10 nodes...

Mar 18 '21 11:03 lahwaacz

Cori (cori.nersc.gov) currently has 2,388 haswell nodes, and 9,688 KNL nodes. No GPU's.

Perlmutter I am not at liberty to disclose at this time, but it has GPU's, and it will have more nodes than Cori.

Mar 18 '21 16:03 ThomasADavis

@ThomasADavis : I see. That's quite a difference in term of installation size.

Take a look at the gpus_acct branch

There is only 1 commit of difference with the master branch:

By default, the GPUs collector is now disabled.
A command line option -gpus-acct must be set to true in order to enable it.

Mar 18 '21 16:03 mtds

I'll just add.. "We break things."

I will look at it. I thought we was still under blackout, but they did post that there will be 6000+ GPU's in the perlmutter phase 1 system.

Mar 18 '21 16:03 ThomasADavis

I'll just add.. "We break things."

For us it's interesting to know that there are such big installations using this exporter! And bug reports are always welcome :-)

[...] but they did post that there will be 6000+ GPU's in the perlmutter phase 1 system.

Those are definitely more GPUs than we are expecting to receive and install in the next months...and next years as well, I guess.

I cannot say now if sacct will perform correctly: with utilities that interact directly with the Slurm DB backend, there is always the possibility that 'horrible' SQL queries behind the scenes will turn out in timed out answers. This is the reason that (whenever possible) we have used only sinfo, squeue and sdiag.

Mar 18 '21 17:03 mtds

We have a contract with the slurm people to deal with some of those issues.

Mar 18 '21 17:03 ThomasADavis

[...] which grafana dashboard works with the patch?

@crinavar : you can try the one I made, available here: https://grafana.com/grafana/dashboards/4323

Note: there are no graph panels (yet) for GPUs, since we do not have so much of that HW in our current installation so I did not have the chance to create an additional dashboard so far.

Many thanks, I am now testing the patch presented by @lahwaacz but it is giving this error slurm-exporter_1 | panic: runtime error: index out of range EDIT: solved, now the patch works and exporter working properly. For anyone having the same problem, the error was caused because i had the gres.conf file like this

# GRES configuration for native GPUS
# DGX A100: 8x Nvidia A100

# Autodetect not working
#AutoDetect=nvml
Name=gpu File=/dev/nvidia[0-7]

Not having the "Type" made the index "2" in the patch of "gpus.go" produce an out of bounds error. Adding Type solved the problem

# GRES configuration for native GPUS
# DGX A100: 8x Nvidia A100

# Autodetect not working
#AutoDetect=nvml
Name=gpu Type=A100 File=/dev/nvidia[0-7]

The patch actually has a very important comment i didnt pay attention to // format: gpu:<type>:N(S:<something>), e.g. gpu:RTX2070:2(S:0)

best

Apr 10 '21 16:04 crinavar

Version 0.19 introduces a breaking changes: by default the GPUs accounting will not be enabled but a command line option can be used to explicitly activate it. Note that the exporter will also log the status of this function, when it's enabled or not.

Considering the ongoing discussion here and what was also reported in issue #40, we have decided to change the default behaviour of the exporter and play the safe bet of keeping such a functionality off by default.

Until we have a chance to test this feature on our cluster (we are going through the process of acquiring new servers equipped with GPUs), we will leave these issues open.

It would be useful if other users could report how the GPUs accounting functionality is working in their infrastructure. In particular:

version of Slurm;
details about the GPUs configuration in the Slurm configuration.

Apr 16 '21 18:04 mtds

Ever since slurm v19.05.0rc1, slurm provides another way to check for Available and Active GRES, i.e. via : sinfo -a -h --Format=Nodes,Gres,GresUsed. I have refactored the gpus.go to be based on this call and consider also cases where gres_type is defined. See PR #73. I have tested on Slurm 21.08.5, and note that at the moment for Slurm version below 19.05.0rc1 it is better to stay on the old implementation.

Mar 04 '22 15:03 itzsimpl

@itzsimpl I have merged your updated PR into the development branch (among other contributions). Please take a look and let me know if it works.

We are currently not able to test this exporter, for the GPU part, on newer version of Slurm, so I am trusting the feedback from other users about stability and functionalities.

Last but not least: thanks!!

Mar 29 '22 13:03 mtds

Hey, some feedback for Slurm 20.11.9 (CentOS 7, 20.11.9-1.el7.x86_64). As far as I can tell, the GPU export look alright using the development branch.

Raw Output:

sinfo -a -h --Format="Nodes: ,GresUsed:" --state=allocated
4 gpu:tesla:4(IDX:0-3)
1 gpu:tesla:3(IDX:0,2-3)
1 gpu:tesla:1(IDX:1)
1 gpu:tesla:1(IDX:3)
1 gpu:tesla:2(IDX:1-2)

sinfo -a -h --Format="Nodes: ,Gres: ,GresUsed:" --state=idle,allocated
4 (null) gpu:0
4 gpu:tesla:4 gpu:tesla:4(IDX:0-3)
1 gpu:tesla:4 gpu:tesla:3(IDX:0,2-3)
1 gpu:tesla:4 gpu:tesla:1(IDX:1)
1 gpu:tesla:4 gpu:tesla:1(IDX:3)
1 gpu:tesla:4 gpu:tesla:2(IDX:1-2)

sinfo -a -h --Format="Nodes: ,Gres:"
4 (null)
8 gpu:tesla:4

Parsed:

# HELP slurm_gpus_alloc Allocated GPUs
# TYPE slurm_gpus_alloc gauge
slurm_gpus_alloc 23
# HELP slurm_gpus_idle Idle GPUs
# TYPE slurm_gpus_idle gauge
slurm_gpus_idle 9
# HELP slurm_gpus_other Other GPUs
# TYPE slurm_gpus_other gauge
slurm_gpus_other 0
# HELP slurm_gpus_total Total GPUs
# TYPE slurm_gpus_total gauge
slurm_gpus_total 32
# HELP slurm_gpus_utilization Total GPU utilization
# TYPE slurm_gpus_utilization gauge
slurm_gpus_utilization 0.71875

Got a GPU Cluster if you need further testing before a new release. Anything I can help with to get the development branch ready to merge?

Jun 14 '22 09:06 martialblog

@martialblog thanks for testing; I did not have the chance to test the development branch, but PR #73 has been up and running on SLURM 21.08.5 for a couple of months now.

Jun 14 '22 13:06 itzsimpl

prometheus-slurm-exporter prometheus-slurm-exporter copied to clipboard

gpu version

prometheus-slurm-exporter
prometheus-slurm-exporter copied to clipboard