cube-studio Grafana模块监控异常

0180ed606c1b58161fb392a7fdbd5ee

Jun 09 '22 01:06 gilearn

日志如下：

t=2022-06-09T14:16:26+0800 lvl=eror msg="Request Completed" logger=context userId=0 orgId=1 uname= method=GET path=/api/datasources/proxy/1/api/v1/query_range status=502 remote_addr=10.168.2.252 time_ms=30002 size=0 referer="http://10.168.2.252/grafana/d/all-node/all-node?orgId=1&refresh=5s" t=2022-06-09T14:16:26+0800 lvl=eror msg="Data proxy error" logger=data-proxy-log userId=0 orgId=1 uname= path=/api/datasources/proxy/1/api/v1/query_range remote_addr=10.168.2.252 referer="http://10.168.2.252/grafana/d/all-node/all-node?orgId=1&refresh=5s" error="http: proxy error: dial tcp 10.43.83.231:9090: i/o timeout" t=2022-06-09T14:16:26+0800 lvl=eror msg="Request Completed" logger=context userId=0 orgId=1 uname= method=GET path=/api/datasources/proxy/1/api/v1/query_range status=502 remote_addr=10.168.2.252 time_ms=30001 size=0 referer="http://10.168.2.252/grafana/d/all-node/all-node?orgId=1&refresh=5s" t=2022-06-09T14:16:31+0800 lvl=eror msg="Alert Rule Result Error" logger=alerting.evalContext ruleId=1 name="System load alert" error="alert execution exceeded the timeout" changing state to=alerting

查到一个贴子：https://grafana.docs.cern.ch/4._Troubleshooting/2-proxy-error-context-canceled/ 还是不大明白，这个贴子说的是让改哪个js代码？

Jun 09 '22 06:06 gilearn

grafana 可以看到界面包含了几个地方的功能需要： 1、每个机器的采集器是否正常包含node-export/dcgm-exporter(gpu) 2、prometheus是否正常，可以在prometheus里面看数据是否正常 3、查看grafana是否正常，主要是看板的配置 4、最后是网关带来了grafana的url访问

从你的截图里面看，应该grafana链接不上prometheus，所以怀疑是你的prometheus未正常启动

Jun 10 '22 05:06 data-infra

prometheus正常启动着呢日志里面有一些警告，不知道有没有影响 level=warn ts=2022-06-14T04:58:58.12469297Z caller=klog.go:86 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:262: watch of *v1.Endpoints ended with: too old resource version: 3201509 (3202538)" level=info ts=2022-06-14T05:00:03.954078526Z caller=compact.go:496 component=tsdb msg="write block" mint=1655172000000 maxt=1655179200000 ulid=01G5G9ADRC0DSEC1K1KEM5TZ81 duration=3.301850879s level=info ts=2022-06-14T05:00:04.202635464Z caller=head.go:536 component=tsdb msg="head GC completed" duration=108.122048ms level=info ts=2022-06-14T05:00:09.252410301Z caller=head.go:583 component=tsdb msg="WAL checkpoint complete" first=173 last=175 duration=5.049629249s

Jun 14 '22 06:06 gilearn

kubeflow-prometheus-adapter运行正常

日志报这个错 E0614 06:59:25.408045 1 provider.go:229] unable to update list of all metrics: unable to fetch metrics for query "{name="DCGM_FI_DEV_GPU_UTIL",exported_pod!="",exported_namespace="service"}": Get "http://prometheus-k8s.monitoring.svc:9090/api/v1/series?match%5B%5D=%7B__name__%3D%22DCGM_FI_DEV_GPU_UTIL%22%2Cexported_pod%21%3D%22%22%2Cexported_namespace%3D%22service%22%7D&start=1655189875.407": dial tcp 10.43.83.231:9090: i/o timeout

Jun 14 '22 07:06 gilearn

kubeflow-prometheus-adapter这个模块的shell进去报错：

Jun 14 '22 07:06 gilearn

commit 0b6b2aa03dadbf90e42cba02f853a69d93d5c40e 修复了这个问题

Jun 15 '22 04:06 data-infra

cube-studio cube-studio copied to clipboard

Grafana模块监控异常

cube-studio
cube-studio copied to clipboard