treydock
treydock
We actually don't have a dashboard using the metrics from this exporter, at least not yet. We do have numerous Prometheus record rules. I've uploaded those record rules here: https://github.com/treydock/infiniband_exporter/blob/main/examples/infiniband.rules...
Looks like some of my colleagues put some into our Grafana instance so I've dumped those into the examples folder at https://github.com/treydock/infiniband_exporter/tree/main/examples. Most of the PromQL relies on the record...
@vsoch Your trying to SSH from where to where? Like from docker container to docker container or from docker container to a host? I'm guessing the place your trying to...
What is the full command used to launch the exporter? If you have HCA or switch collection enabled try passing `--no-collector.hca` or `--no-collector.switch` and provide what metric is returned for...
What metrics are present related to `infiniband_switch_collect_timeout`? You should be able to do something like this when launched with `--no-collector.switch`: ``` curl http://localhost:9315/metrics | grep infiniband_switch_collect_timeout ``` I'm curious why...
I found the problem and testing fixes in #32
This is part of https://github.com/treydock/infiniband_exporter/releases/tag/v0.10.0-rc.1. The new docker image is also pushed with v0.10.0-rc.1 tag. I need to do more testing and also merge another PR that needs some extra...
I will note that the ability to tie a given GPU to a job is something very specific to OSC and not coming from the exporters like the one NVIDIA...
Ah custom exporter, we just use the one from NVIDIA: https://github.com/NVIDIA/dcgm-exporter. Our prolog script: ``` if [ "x${CUDA_VISIBLE_DEVICES}" != "x" ]; then GPU_INFO_PROM=${METRICS_DIR}/slurm_job_gpu_info-${SLURM_JOB_ID}.prom cat > $GPU_INFO_PROM.$$ $GPU_INFO_PROM.$$ done IFS=$OIFS /bin/mv...
PromQL from our dashboards that tie a job to a given GPU: ``` DCGM_FI_DEV_GPU_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"} DCGM_FI_DEV_MEM_COPY_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"} DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"} max((DCGM_FI_DEV_FB_FREE{cluster="$cluster",host=~"$host"} + DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"}) * ON(host,gpu)...