GraphScope Monitoring for GraphScope

Prometheus 是 Kubernetes 生态中的一个开源系统监控告警解决方案，在本任务中，你需要结合 Prometheus 为 GraphScope 系统提供一套统一的 UI，来帮助监控、追踪 GraphScope 集群的运行状态，具体包含子任务如下：

生产监控数据

[Python] 在 GraphScope Coordinator 中暴露 HTTP 服务，负责生成 Prometheus 可识别的日志信息
[Python] 为 GraphScope Store 实现一个 Prometheus Exporter，该 Exporter 可以是 Python 脚本，负责解析服务日志并生成 Prometheus 可识别的日志信息
[集成] 对接 NodeExporter 来监控集群节点状态，如 CPU、内存，磁盘空间等

监控 UI

配置 Prometheus，使用 Grafana 可视化展示生成的数据。

此外，你也可以进一步集成 AlertManager 功能，以完善用户体验。

GraphScope Coordinator 指标

GraphScope Session 状态
载图时间
图数据大小
分析型任务完成的数目
交互性任务完成的数目
分析型任务每轮迭代的性能时间
分析型任务每轮迭代消息传递的数量
交互型任务的查询时间

GraphScope GIE 服务指标

点/边标签(label) 数量
点/边数量
当前图数据大小
服务延迟
服务 QPS
服务 RT
服务吞吐量
服务成功/失败数量
服务失败率

项目产出物

一个 Prometheus Exporter，该 Exporter 可以是 Python 脚本，负责解析 GraphScope Store 服务日志并生成 Prometheus 可识别的日志信息
在 GraphScope Coordinator 中暴露 Http 服务，负责生成 Prometheus 可识别的日志信息
结合 Prometheus Grafana 的可视化展示页面

帮助信息

vineyard 镜像地址: registry.cn-hongkong.aliyuncs.com/graphscope/graphscope-vineyard:latest

May 11 '21 09:05 yecol

Integration with Prometheus is also a possible approach for monitoring graphscope clusters.

May 14 '21 05:05 sighingnow

Prometheus is an open-source systems monitoring and alerting toolkit in the Kubernetes ecosystem. In this task, You will be asked to provide a unified monitoring UI for GraphScope, and at least the following features should be included:

Data producer

[Python] Expose a HTTP service to produce Prometheus recognizable log in GraphScope Coordinator.
[Python] Implement a Prometheus exporter for GraphScope Store Service, this exporter can be a python script, which is responsible for parsing the service log and produce Prometheus recognizable data.
[Intergration] Monitor nodes stats, e.g., cpu/memory, disk space with Node Exporter.

Monitoring UI

Configure Prometheus to visualize data with Grafana for GraphScope.

Additionally, You can further integrate with the AlertManager for a more complete and easy-to-use experience.

Metrics for GraphScope Coordinator

the status of GraphScope Session
the time of loading graph
the size of existing graph data
the number of analytical tasks has been completed
performance(iteration time of each round) of analytical task
the number of message passing of analytical task
performance(query time) of interactive task

Metrics for GraphScope Store Service

number of vertex/edge label
number of vertex/edge
size of current graph
service latency
service QPS
service RT
service throughput
number of failed/success requests
failure rate

May 05 '22 15:05 lidongze0629

Coordinator监控遇到的一些问题：

载图时间：每张图都有自己的名字，需要区分不同图的载图时间，这个名字从哪里获取？
analytical tasks 时间：
- coordinator以grpc向computing engine发送任务请求，因此可以监测grpc client从send request到response arrived的时间。
- 但是grpc发送的可以是一个requests list，其中包含了多个任务。想要在coordinator中获取每个任务的执行时间比较困难，这个信息需要在engine server中获得 ----> 或者是否能在fetch logs中获得该信息？

Jul 11 '22 04:07 VincentFF

@VincentFF

1. [GraphScope Coordinator 指标] 载图时间：

由于 Coordinator 看到的是由一个个 OP 组成的 DAG，因此从 Coordinator 的角度很难确定"一张图"的概念，例如 CREATE_GRAPH、ADD_LABELS、SUBGRAPH 等 OP 类型返回的都是一个图，因此我们不单独统计载图时间，而是记录每个 OP 类型的执行时间（这样其实也涵盖了指标中的载图、交互式引擎的查询时间等），最终用一张类似于下面的时序图展示 coordinator 执行的每个 op 的时间（不同 op的颜色不同）

2. Analytical tasks 时间

Analytical tasks 没轮迭代的时间和消息传递数量可在 coordinator 的 log 中获得，具体位置在这里，不过目前输出的日志信息中不包含时间和消息传递数量，我们可以根据期待的格式，在 Engine 端加一下

def write(self, line):
        line = self._filter_progress(line)
        if line is None:
            return
        # 下面两行代码，就可以将 engine 的执行日志写到文件中
        with open("/tmp/coordinator_stdstream.log", "a") as f:
            f.write(line)
        self._stream_backup.write(line)
        self._stream_backup.flush()
        line = line.encode("utf-8", "ignore").decode("utf-8")
        if not self._drop:
            self._lines.put(line)

另外，针对 GraphScope Coordinator 指标 这一部分，能否花点时间先给一个最终可视化效果的的设计呢，比如 Grafana 中用哪个图表展示哪项数据指标，整体布局是什么样，这样确定下来后，我们也了解需要生成什么样的日志信息，简单设计一下就行，或者手画也可以。

Jul 12 '22 09:07 lidongze0629

@lidongze0629 解释得非常清楚，非常谢谢您。我研究一下log目前的信息和格式，看看还需要添加或修改什么，再给您答复。 “最终可视化效果的的设计” --- 这个我这两天思考一下，设计好了在这里留言。

Jul 13 '22 12:07 VincentFF

7月19日 Coordinator监控指标

1. 载图时间 2. 交互型任务的查询时间

可通过统计 OP 的执行时间来展示，最终放在一张时序图中，不同 OP 的颜色不同

GraphScope Session 状态

比如如果有client链接，状态为连接 (Connected) https://github.com/alibaba/GraphScope/blob/main/coordinator/gscoordinator/coordinator.py#L259
如果client 断开链接，状态为未连接 (Closed)

https://github.com/alibaba/GraphScope/blob/main/coordinator/gscoordinator/coordinator.py#L803

如果有client连接，但是一段时间没有收到心跳，状态为失联 DisConnected

https://github.com/alibaba/GraphScope/blob/main/coordinator/gscoordinator/coordinator.py#L334

分析型任务完成的数目
交互性任务完成的数目

https://github.com/alibaba/GraphScope/blob/main/coordinator/gscoordinator/coordinator.py#L360 https://github.com/alibaba/GraphScope/blob/main/coordinator/gscoordinator/coordinator.py#L501

图数据大小
分析型任务每轮迭代消息传递的数量

这两个指标先不考虑

分析型任务每轮迭代的性能时间

这个看看是否能用柱状图展示

Jul 20 '22 02:07 lidongze0629

Screenshot from 2022-07-20 06-41-38

7月20日coordinator监控指标

Session状态
Session有三种情况: Connected, Closed, DisConnected。
DisConnected状态会调用cleanup清除Session。所以我在图表中将Session归纳为两种状态： On/Off。
分析/交互型任务完成数目统计在图中展示了5个数据：

Total requests = Analytical requests + Interactive requests
Analytical requests
Interactive requests
Last 5 mins analytical requests
Last 5 mins interactive requests

Op时间/载图时间由于载图由一系列OP组成，因此我归纳在一张图里。区分了Analytical op 和 interactive op。每个不同的op分别以不同颜色画线。
分析型任务每轮迭代时间这个暂时还没做，待日志将相关信息传给coordinator功能完成后开始。

一些讨论

session_id是否作为metric的标签将session_id作为标签可以区分每个op由哪个session发起的。但由于coordinator每次只允许一个session连接，我觉得将session_id作为标签没有意义，反而冗余。
Op时间metric的选择对于op时间的监控，我选择了来Guage记录而没有选择Summary/Histogram。

通常对于这样的时间统计，标准方案是选择Summary/Histogram。如统计一个web接口的请求时间，通常选择这两个指标来计算。但这两个指标是一个统计值，适合用来计算：如x mins内的平均请求时间等。

考虑到graphscope同一个op在不同图下的计算时间差异巨大，应该记为不同操作，所以我选择用Guage来记录这个op操作时间的瞬时状态。

以上两点是我自己的想法，我不是很确定实际需求是否与我想法一致，如需要可以修改。 @lidongze0629

Jul 20 '22 11:07 VincentFF

@VincentFF

session id 可以不作为标签
针对Op时间metric的选择，最好不同的op都画在一条线上(即他们的纵坐标相同，以颜色不同区分)，因为op是串型执行的，效果如下

Jul 21 '22 13:07 lidongze0629

@VincentFF

session id 可以不作为标签

针对Op时间metric的选择，最好不同的op都画在一条线上(即他们的纵坐标相同，以颜色不同区分)，因为op是串型执行的，效果如下

@lidongze0629 我尝试了一下，这样的表达图暂时无法做到。
原因在于：
prometheus里面timestamp的概念不是我将metrics写入web的时间，是promethues抓取metrics写入数据的时间。我将需要的指标计算出来暴露在web中，prometheus在一个设定的intelval来定时抓取web中的数据（一次抓取一批），然后写入自己的数据库并给一个写入数据时的时间戳。所以我一个算法执行了多个op，然后promethues一次性将这多个op的数据抓取存入自己的时序数据库，它们的时间是几乎一致的。

目前我给出的方案如下图： Screenshot from 2022-07-22 15-35-26

其中左图是最近一次算法中各op的运行时间数据及对比，右图是各op操作时间的历史走势图（点击某一个op单独观察）。

Jul 22 '22 07:07 VincentFF

GraphScope GIE 服务指标

服务 QPS (每秒查询率) 服务成功/失败数量服务失败率

@VincentFF 针对每个 gremlin query，存储端可以打印一条日志数据，为了可视化上述指标，你定一下数据格式？

Aug 16 '22 03:08 lidongze0629

GraphScope GIE 服务指标

服务 QPS (每秒查询率) 服务成功/失败数量服务失败率

@VincentFF 针对每个 gremlin query，存储端可以打印一条日志数据，为了可视化上述指标，你定一下数据格式？

日志格式: Query Report: JSON_STRING

# JSON_STRING 定义如下
{
  "query": "g.V().count()",
  "success:" true, # true/false
  "execTime": 30,  # 单位毫秒
  "timestamp": "xxxxxx"  # unix timestamp
}

Aug 16 '22 08:08 lidongze0629

查看存储日志流程:

下载镜像: registry.cn-hongkong.aliyuncs.com/graphscope/graphscope-vineyard:v0.6.0
启动容器命令: 需要 ${HOME}/GraphScope 有你的代码库

docker run --shm-size 102400m --name summercode -it -v ${HOME}/GraphScope:/work registry.cn-hongkong.aliyuncs.com/graphscope/graphscope-vineyard:v0.6.0 /bin/bash

编译在容器中 cd /work && make install
运行 GIE gremlin 语句查询，在 /work/python 下执行 ipython，随后在解释器中输入:

import graphscope
graphscope.set_option(show_log=True)
sess = graphscope.session(cluster_type="hosts")
from graphscope.dataset import load_modern_graph
g = load_modern_graph(sess)
interactive = sess.gremlin(g)
interactive.execute("g.V().count()").all()
interactive.execute("g.V(1)").all()
interactive.execute("g.V(1).valueMap()").all()

日志位置： /var/log/graphscope/1910021072443001/frontend/metric.log

[graphscope@7114d7774e6e frontend]$ cat metric.log
1 | g.V().limit(1) | true | 707.33826 | 1661415705968
2 | g.V().count() | true | 19.30906 | 1661415726804
3 | g.V().count() | true | 6.834841 | 1661415733562
4 | g.V(1) | true | 13.802731 | 1661415743342
5 | g.V(1).valueMap() | true | 14.364406 | 1661415747511
6 | g.V(1).valueMap() | true | 6.674943 | 1661415823324

Aug 25 '22 08:08 lidongze0629

GraphScope GIE 服务指标
分了两个部分：

Total Query: 为总的数据
API Query: 各个请求接口的详细数据 @lidongze0629

99E006DA-154E-45a8-8A9D-232A027ADE09

Sep 12 '22 10:09 VincentFF

Has this task been completed? We are currently querying based on GIE. I want to build a GIE monitoring platform. Is there any information available,

such as:

1. Exposed ip:port/metrics url
2. Grafana dashborad json file

Dec 01 '23 06:12 JackyYangPassion

get read me https://github.com/alibaba/GraphScope/tree/main/k8s/prometheus

Dec 01 '23 06:12 JackyYangPassion

GraphScope GraphScope copied to clipboard

Monitoring for GraphScope

GraphScope
GraphScope copied to clipboard