HAMi icon indicating copy to clipboard operation
HAMi copied to clipboard

fix grafana dashboard and clarify dashboard usage more clearly.

Open jiangsanyin opened this issue 1 year ago • 2 comments

Signed-off-by: jiangsanyin [email protected]

What type of PR is this? /kind bug

What this PR does / why we need it: fix grafana dashboard and clarify dashboard usage more clearly. Thanks "fangfenghuang (https://github.com/fangfenghuang)" for your help

Which issue(s) this PR fixes: Fixes #498 #468

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

jiangsanyin avatar Oct 10 '24 01:10 jiangsanyin

@fangfenghuang Can you help review this pr?

wawa0210 avatar Oct 24 '24 02:10 wawa0210

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Flag Coverage Δ
unittests 27.09% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

codecov[bot] avatar Oct 24 '24 02:10 codecov[bot]

@jiangsanyin I have followed the installation instructions as described in the documentation, but encountered a minor issue, which I also mentioned previously in Issue #498. By default, the dcgm-exporter only includes the Hostname label. To match the current Grafana dashboard configuration, it's necessary to add a node_name relabeling configuration when installing dcgm-exporter

https://github.com/NVIDIA/dcgm-exporter/blob/b97b7633e3f39f7a537bd77561cc0ec0c2dca3f5/deployment/values.yaml#L117C3-L117C18

This relabeling should be consistent with the configurations for hami-device-plugin-svc-monitor and hami-scheduler-svc-monitor

It would be helpful to include this information in the documentation, as users unfamiliar with the Prometheus stack may struggle to configure everything correctly on the first attempt

Nimbus318 avatar Nov 25 '24 07:11 Nimbus318

@jiangsanyin I have followed the installation instructions as described in the documentation, but encountered a minor issue, which I also mentioned previously in Issue #498. By default, the dcgm-exporter only includes the Hostname label. To match the current Grafana dashboard configuration, it's necessary to add a node_name relabeling configuration when installing dcgm-exporter

https://github.com/NVIDIA/dcgm-exporter/blob/b97b7633e3f39f7a537bd77561cc0ec0c2dca3f5/deployment/values.yaml#L117C3-L117C18

This relabeling should be consistent with the configurations for hami-device-plugin-svc-monitor and hami-scheduler-svc-monitor

It would be helpful to include this information in the documentation, as users unfamiliar with the Prometheus stack may struggle to configure everything correctly on the first attempt

Have you created and applied the ServiceMonitor as depicted in dashboard.md or dashboard_cn.md?node_name is added after this is done. #Create the file hami-device-plugin-svc-monitor.yaml root@controller01:~# cat hami-device-plugin-svc-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: hami-device-plugin-svc-monitor namespace: kube-system spec: selector: matchLabels: app.kubernetes.io/component: hami-device-plugin namespaceSelector: matchNames: - "kube-system" endpoints:

  • path: /metrics port: monitorport interval: "15s" honorLabels: false relabelings:
    • sourceLabels: [__meta_kubernetes_endpoints_name] regex: hami-.* replacement: $1 action: keep
    • sourceLabels: [__meta_kubernetes_pod_node_name] regex: (.*) targetLabel: node_name replacement: ${1} action: replace
    • sourceLabels: [__meta_kubernetes_pod_host_ip] regex: (.*) targetLabel: ip replacement: $1 action: replace

#apply the file hami-device-plugin-svc-monitor.yaml root@controller01:~# kubectl apply -f hami-device-plugin-svc-monitor.yaml

jiangsanyin avatar Nov 25 '24 07:11 jiangsanyin

@jiangsanyin Both are correct. What I meant is that you might have forgotten to include the explanation for the relabel configuration of dcgm-exporter. By default, dcgm-exporter only includes the Hostname label

It’s important to document this configuration to ensure it aligns with the relabeling setup for hami-device-plugin-svc-monitor. Without this explanation, users may miss adding the necessary node_name relabeling when setting up dcgm-exporter

Nimbus318 avatar Nov 25 '24 08:11 Nimbus318

@jiangsanyin Both are correct. What I meant is that you might have forgotten to include the explanation for the relabel configuration of dcgm-exporter. By default, dcgm-exporter only includes the Hostname label

It’s important to document this configuration to ensure it aligns with the relabeling setup for hami-device-plugin-svc-monitor. Without this explanation, users may miss adding the necessary node_name relabeling when setting up dcgm-exporter

Ok, thanks to your review. Certain relabelings configurations in serviceMonitor for dcgm-exporter has been added in dashboard_cn.md and dashboard.md, please check! image image

jiangsanyin avatar Nov 28 '24 03:11 jiangsanyin

/lgtm

Nimbus318 avatar Nov 28 '24 03:11 Nimbus318

/lgtm

wawa0210 avatar Dec 19 '24 05:12 wawa0210