treydock comments

Results 151 comments of


                                            treydock

Example Grafana dashboard

We actually don't have a dashboard using the metrics from this exporter, at least not yet. We do have numerous Prometheus record rules. I've uploaded those record rules here: https://github.com/treydock/infiniband_exporter/blob/main/examples/infiniband.rules...

Example Grafana dashboard

Looks like some of my colleagues put some into our Grafana instance so I've dumped those into the examples folder at https://github.com/treydock/infiniband_exporter/tree/main/examples. Most of the PromQL relies on the record...

Need to configure ssh (port 22) access

@vsoch Your trying to SSH from where to where? Like from docker container to docker container or from docker container to a host? I'm guessing the place your trying to...

panic when collector ibswinfo is enabled (docker container)

What is the full command used to launch the exporter? If you have HCA or switch collection enabled try passing `--no-collector.hca` or `--no-collector.switch` and provide what metric is returned for...

panic when collector ibswinfo is enabled (docker container)

What metrics are present related to `infiniband_switch_collect_timeout`? You should be able to do something like this when launched with `--no-collector.switch`: ``` curl http://localhost:9315/metrics | grep infiniband_switch_collect_timeout ``` I'm curious why...

panic when collector ibswinfo is enabled (docker container)

I found the problem and testing fixes in #32

panic when collector ibswinfo is enabled (docker container)

This is part of https://github.com/treydock/infiniband_exporter/releases/tag/v0.10.0-rc.1. The new docker image is also pushed with v0.10.0-rc.1 tag. I need to do more testing and also merge another PR that needs some extra...

gpu grafana panel

I will note that the ability to tie a given GPU to a job is something very specific to OSC and not coming from the exporters like the one NVIDIA...

gpu grafana panel

Ah custom exporter, we just use the one from NVIDIA: https://github.com/NVIDIA/dcgm-exporter. Our prolog script: ``` if [ "x${CUDA_VISIBLE_DEVICES}" != "x" ]; then GPU_INFO_PROM=${METRICS_DIR}/slurm_job_gpu_info-${SLURM_JOB_ID}.prom cat > $GPU_INFO_PROM.$$ $GPU_INFO_PROM.$$ done IFS=$OIFS /bin/mv...

gpu grafana panel

PromQL from our dashboards that tie a job to a given GPU: ``` DCGM_FI_DEV_GPU_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"} DCGM_FI_DEV_MEM_COPY_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"} DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"} max((DCGM_FI_DEV_FB_FREE{cluster="$cluster",host=~"$host"} + DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"}) * ON(host,gpu)...