deepflow icon indicating copy to clipboard operation
deepflow copied to clipboard

[BUG] Deepflow server多节点时其中一个节点故障/重启恢复后会出现负载不均的情况

Open monokoo opened this issue 11 months ago • 11 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

DeepFlow Component

Server

What you expected to happen

Deepflow server多节点时其中一个节点故障/重启恢复后 agent应重新负载均衡

How to reproduce

多节点server环境,重启其中一个server,agent会全部注册到其他存活的server pod;但是故障server节点启动之后agent未重新负载均衡注册导致流量全部集中到其他server上,资源消耗过大进而出现循环重启的恶性情况。server端存在持续消耗内存导致pod OOM的情况

analyzer table表

id state ha_state name description ip nat_ip agg cpu_num memory_size arch os kernel_version tsdb_shard_id tsdb_replica_ip tsdb_data_mount_path pcap_data_mount_path vtap_max synced_at nat_ip_enabled pod_ip pod_name ca_md5 lcuuid
59 2 1 cn-hangzhou.10.183.35.79 10.183.35.79 1 16 66161647616 x86_64 alpine 3.15.0 4.19.91 null null null 200 2024-03-04 18:33:10 0 10.183.62.96 deepflow-server-68b4b6878c-smrrs 960b90e29e4f04af5e9d9608e7a50df6 8788cf15-0759-4437-8604-24469e9d3ffa
61 2 1 cn-hangzhou.10.183.35.92 10.183.35.92 1 8 66163224576 x86_64 alpine 3.15.0 4.19.91 null null null 200 2024-03-04 18:33:08 0 10.183.62.99 deepflow-server-7b9db96bd9-24tdh 960b90e29e4f04af5e9d9608e7a50df6 957e3ab4-8853-4a30-8b3e-4c9c611a7b4a
62 2 1 cn-hangzhou.10.183.35.80 10.183.35.80 1 16 66161651712 x86_64 alpine 3.15.0 4.19.91 null null null 200 2024-03-04 18:33:29 0 10.183.62.95 deepflow-server-697c56d794-6wkdr 960b90e29e4f04af5e9d9608e7a50df6 994e801c-d23e-416d-9c94-01dd8a54619b

controller table表

id state name description ip nat_ip cpu_num memory_size arch os kernel_version vtap_max synced_at nat_ip_enabled node_type region_domain_prefix node_name pod_ip pod_name ca_md5 lcuuid
59 2 cn-hangzhou.10.183.35.79 10.183.35.79 16 66161647616 x86_64 alpine 3.15.0 4.19.91 2000 2024-02-20 10:49:05 0 1 cn-hangzhou.10.183.35.79 10.183.62.96 deepflow-server-c769b5ddd-45mkh 960b90e29e4f04af5e9d9608e7a50df6 332e932f-5414-4294-847f-70a27f97b7b8
61 2 cn-hangzhou.10.183.35.92 10.183.35.92 8 66163224576 x86_64 alpine 3.15.0 4.19.91 2000 2024-02-28 14:50:34 0 1 cn-hangzhou.10.183.35.92 10.183.62.99 deepflow-server-c769b5ddd-zh5j5 960b90e29e4f04af5e9d9608e7a50df6 799d5ddc-8482-4d4e-b5c9-b240ed465a5d
62 2 cn-hangzhou.10.183.35.80 10.183.35.80 16 66161651712 x86_64 alpine 3.15.0 4.19.91 2000 2024-03-01 13:11:19 0 1 cn-hangzhou.10.183.35.80 10.183.62.95 deepflow-server-c769b5ddd-d7k9j 960b90e29e4f04af5e9d9608e7a50df6 52d61376-eabd-4bf1-9716-81f4f1a8bdc8

image image image

DeepFlow version

v6.4最新代码分支编译

DeepFlow agent list

No response

Kubernetes CNI

No response

Operation-System/Kernel version

No response

Anything else

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

monokoo avatar Mar 05 '24 02:03 monokoo

看下 Server 的 ConfigMap 配置值:algorithm 提供下 Grafana deepflow-server logs,过滤下日志:"need rebalance vtap for analyzer"

roryye avatar Mar 05 '24 04:03 roryye

algorithm algorithm 使用的默认 by-ingested-data image

2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.35.14-V28) register counter: {"Weight":0.93,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.35.14-V28"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.33.213-V32) register counter: {"Weight":1.19,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.33.213-V32"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.34.77-V23) register counter: {"Weight":0,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.34.77-V23"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.32.9-V18) register counter: {"Weight":0,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.32.9-V18"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.35.12-V30) register counter: {"Weight":1.24,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.35.12-V30"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.34.232-V10) register counter: {"Weight":1.19,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.34.232-V10"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.34.28-V40) register counter: {"Weight":0.94,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.34.28-V40"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.34.187-V16) register counter: {"Weight":1.51,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.34.187-V16"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:539 agent(cn-hangzhou.10.183.33.212-V38) register counter: {"Weight":1.38,"IsAnalyzerChanged":0,"Name":"cn-hangzhou.10.183.33.212-V38"} 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:131 vtap rebalance result switch_total_num(2) 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(075157f4-b9a6-54e1-9687-f430f91dd4ff) ip(10.183.35.79) state(2) before_vtap_num(2) after_vtap_num(1), switch_vtap_num(1) before_vtap_weigh t(1.05) after_vtap_weight(0.96) 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(075157f4-b9a6-54e1-9687-f430f91dd4ff) ip(10.183.35.92) state(2) before_vtap_num(3) after_vtap_num(4), switch_vtap_num(1) before_vtap_weigh t(0.99) after_vtap_weight(1.05) 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(075157f4-b9a6-54e1-9687-f430f91dd4ff) ip(10.183.35.80) state(2) before_vtap_num(3) after_vtap_num(3), switch_vtap_num(0) before_vtap_weigh t(0.96) after_vtap_weight(0.99) 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(ba068715-5521-56c2-be22-2fbb09fe54ff) ip(10.183.35.79) state(2) before_vtap_num(7) after_vtap_num(7), switch_vtap_num(0) before_vtap_weigh t(1.02) after_vtap_weight(1.01) 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(ba068715-5521-56c2-be22-2fbb09fe54ff) ip(10.183.35.92) state(2) before_vtap_num(10) after_vtap_num(10), switch_vtap_num(0) before_vtap_wei ght(0.99) after_vtap_weight(1.01) 2024-03-05 12:58:08.912 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(ba068715-5521-56c2-be22-2fbb09fe54ff) ip(10.183.35.80) state(2) before_vtap_num(14) after_vtap_num(14), switch_vtap_num(0) before_vtap_wei ght(0.99) after_vtap_weight(0.98) 2024-03-05 12:58:08.913 [INFO]ESC[0m [monitor/vtap] rebalance.go:145 need rebalance, total switch vtap num(2) 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:123 az(075157f4-b9a6-54e1-9687-f430f91dd4ff) vtap(77) analyzer ip changed: 10.183.35.79 -> 10.183.35.92 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.32.35-V6) update weight: 0.29 -> 0.29 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.32.35-V6) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.35.92-V47) update weight: 0.24 -> 0.24 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.35.92-V47) update is_analyzer_changed: 1 -> 1 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.34.94-V4) update weight: 1.87 -> 1.87 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.34.94-V4) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.35.80-V1) update weight: 0.39 -> 0.39 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.35.80-V1) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.35.79-V8) update weight: 0.35 -> 0.35 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.35.79-V8) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.32.46-V9) update weight: 1.57 -> 1.57 2024-03-05 12:58:08.922 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.32.46-V9) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.923 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.34.145-V3) update weight: 2.58 -> 2.58 2024-03-05 12:58:08.923 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.34.145-V3) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.923 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.32.6-V5) update weight: 0.72 -> 0.72 2024-03-05 12:58:08.923 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.32.6-V5) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.927 [INFO]ESC[0m [ckwriter] ckwriter.go:127 New CK writer: Addrs=[10.183.48.49:9000], user=deepflow_rw, database=flow_tag, table=flow_log_custom_field_local, queueCount=10, queueSize=300000, batchSize=128000, flushT imeout=10s, counterName=l7_log-flow_log_custom_field-4, timeZone=Asia/Shanghai

2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.34.201-V24) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.34.232-V10) update weight: 1.19 -> 1.19 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.34.232-V10) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.33.216-V41) update weight: 0.86 -> 0.86 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.33.216-V41) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.33.215-V33) update weight: 1.07 -> 1.07 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.33.215-V33) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.34.29-V14) update weight: 1.43 -> 1.43 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.34.29-V14) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.35.12-V30) update weight: 1.24 -> 1.24 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.35.12-V30) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.34.28-V40) update weight: 0.94 -> 0.94 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.34.28-V40) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.34.23-V17) update weight: 1.14 -> 1.14 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.34.23-V17) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.33.208-V12) update weight: 1.47 -> 1.47 2024-03-05 12:58:08.940 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.33.208-V12) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.34.70-V34) update weight: 0 -> 0 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.34.70-V34) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:545 agent(cn-hangzhou.10.183.35.17-V35) update weight: 0.66 -> 0.66 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:546 agent(cn-hangzhou.10.183.35.17-V35) update is_analyzer_changed: 0 -> 0 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:131 vtap rebalance result switch_total_num(2) 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(075157f4-b9a6-54e1-9687-f430f91dd4ff) ip(10.183.35.79) state(2) before_vtap_num(2) after_vtap_num(1), switch_vtap_num(1) before_vtap_weight(1.05) after_vtap_weight(0.96) 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(075157f4-b9a6-54e1-9687-f430f91dd4ff) ip(10.183.35.92) state(2) before_vtap_num(3) after_vtap_num(4), switch_vtap_num(1) before_vtap_weight(0.99) after_vtap_weight(1.05) 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(075157f4-b9a6-54e1-9687-f430f91dd4ff) ip(10.183.35.80) state(2) before_vtap_num(3) after_vtap_num(3), switch_vtap_num(0) before_vtap_weight(0.96) after_vtap_weight(0.99) 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(ba068715-5521-56c2-be22-2fbb09fe54ff) ip(10.183.35.79) state(2) before_vtap_num(7) after_vtap_num(7), switch_vtap_num(0) before_vtap_weight(1.02) after_vtap_weight(1.01) 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(ba068715-5521-56c2-be22-2fbb09fe54ff) ip(10.183.35.92) state(2) before_vtap_num(10) after_vtap_num(10), switch_vtap_num(0) before_vtap_weight(0.99) after_vtap_weight(1.01) 2024-03-05 12:58:08.941 [INFO]ESC[0m [service.rebalance] traffic.go:133 vtap rebalance result az(ba068715-5521-56c2-be22-2fbb09fe54ff) ip(10.183.35.80) state(2) before_vtap_num(14) after_vtap_num(14), switch_vtap_num(0) before_vtap_weight(0.99) after_vtap_weight(0.98) ESC[31m2024-03-05 12:58:08.941 [ERRO]ESC[0m [monitor/vtap] rebalance.go:147 fail to rebalance analyzer by data(if check: false):

monokoo avatar Mar 05 '24 05:03 monokoo

我看 12:58 的日志 10.183.35.92 (早上重启的 server)上分配了采集器,看起来每个数据节点的采集器权重都接近 1,是比较均衡的,现在是恢复了吗?

可以看下早上重启时,到其他 server 崩溃区间的 traffic.go 日志

roryye avatar Mar 05 '24 06:03 roryye

我看 12:58 的日志 10.183.35.92 (早上重启的 server)上分配了采集器,看起来每个数据节点的采集器权重都接近 1,是比较均衡的,现在是恢复了吗?

可以看下早上重启时,到其他 server 崩溃区间的 traffic.go 日志

这个均衡动作看日志很少触发。deepflow-ctl agent rebalance 这个命令也无法手动均衡,执行后server端没动静

monokoo avatar Mar 05 '24 06:03 monokoo

我看 12:58 的日志 10.183.35.92 (早上重启的 server)上分配了采集器,看起来每个数据节点的采集器权重都接近 1,是比较均衡的,现在是恢复了吗? 可以看下早上重启时,到其他 server 崩溃区间的 traffic.go 日志

这个均衡动作看日志很少触发。deepflow-ctl agent rebalance 这个命令也无法手动均衡,执行后server端没动静

定时均衡间隔默认是 1h,对应配置:rebalance-interval

deepflow-ctl agent rebalance -t analyzer 命令输出 "no balance required" 说明采集器分配数据节点均衡,不需要执行均衡操作;否则输出均衡前后的操作日志

roryye avatar Mar 05 '24 07:03 roryye

我看 12:58 的日志 10.183.35.92 (早上重启的 server)上分配了采集器,看起来每个数据节点的采集器权重都接近 1,是比较均衡的,现在是恢复了吗? 可以看下早上重启时,到其他 server 崩溃区间的 traffic.go 日志

这个均衡动作看日志很少触发。deepflow-ctl agent rebalance 这个命令也无法手动均衡,执行后server端没动静

定时均衡间隔默认是 1h,对应配置:rebalance-interval

deepflow-ctl agent rebalance -t analyzer 命令输出 "no balance required" 说明采集器分配数据节点均衡,不需要执行均衡操作;否则输出均衡前后的操作日志

rebalance.go:147 fail to rebalance analyzer by data(if check: false): 此处日志不是说均衡失败了吗? 现在节点间流量分布不均衡,server端检测到应该rebalance,但是failed

image

monokoo avatar Mar 05 '24 08:03 monokoo

我看 12:58 的日志 10.183.35.92 (早上重启的 server)上分配了采集器,看起来每个数据节点的采集器权重都接近 1,是比较均衡的,现在是恢复了吗? 可以看下早上重启时,到其他 server 崩溃区间的 traffic.go 日志

这个均衡动作看日志很少触发。deepflow-ctl agent rebalance 这个命令也无法手动均衡,执行后server端没动静

定时均衡间隔默认是 1h,对应配置:rebalance-interval deepflow-ctl agent rebalance -t analyzer 命令输出 "no balance required" 说明采集器分配数据节点均衡,不需要执行均衡操作;否则输出均衡前后的操作日志

rebalance.go:147 fail to rebalance analyzer by data(if check: false): 此处日志不是说均衡失败了吗? 现在节点间流量分布不均衡,server端检测到应该rebalance,但是failed

image

此处日志有问题,忘记做 err 判断了,欢迎提 PR

上面的日志打印说明已经执行完均衡了,这个时间点之后数据节点流量还是不均衡吗

roryye avatar Mar 05 '24 09:03 roryye

此处日志有问题,忘记做 err 判断了,欢迎提 PR

上面的日志打印说明已经执行完均衡了,这个时间点之后数据节点流量还是不均衡吗 是有这种情况,如下 image image

monokoo avatar Mar 05 '24 10:03 monokoo

algorithm 配置设置成 by-ingested-data 是按照采集器的流量来分配的,不是按照个数来分配的;按个数分配可配置:by-agent-count

采集器采集流量统计可通过下面查看 image

roryye avatar Mar 07 '24 02:03 roryye

https://github.com/deepflowio/deepflow/issues/5738 这个缺陷修复后,目前负载均衡也运行正常了

monokoo avatar Mar 20 '24 02:03 monokoo

@roryye server新增了一个节点,没有正常按照日志显示的预期分配负载 image image

monokoo avatar Mar 21 '24 01:03 monokoo