deepflow icon indicating copy to clipboard operation
deepflow copied to clipboard

[BUG] Server start too slow

Open jiekun opened this issue 2 years ago • 9 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

DeepFlow Component

Server

What you expected to happen

DeepFlow server runs a lot of preparation logic when starting. It takes several minutes before it could really process the messages from agent.

We have 20+ server pod. Once they restart (reboot) / crash / rolling update, there will be a significantly long time period that "no one is actually working":

  1. The health check become green once the querier is ready. It return 200 so the dashboard on Grafana is available. But the server itself is not ready for processing new message.
  2. When the healty check returns 200, Kubernetes consider it's ready and destory the previous pod. And another new pod is scheduled and trying to be ready.
  3. So within a minute or two, all pods are restarted, reach 200 status, but no one is working.

Essentially we expect:

  1. Server could be ready within a minute.
  2. Do not claim itself healthy before everything is ready.

How to reproduce

Restart the DeepFlow server.

DeepFlow version

Any

DeepFlow agent list

Not relevant

Kubernetes CNI

No response

Operation-System/Kernel version

Not relevant

Anything else

Not relevant

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

jiekun avatar Sep 23 '23 02:09 jiekun

@lzf575 可能要考虑下 ingester也接入健康检查接口,ingester就绪之后,健康检查接口再返回200

Nick-0314 avatar Sep 26 '23 02:09 Nick-0314

其实主要问题在与如何减少启动所需的时间,我还不太熟悉其中流程,选主、执行 SQL 等等,他们看起来花费的时间太过长了。

假设如下场景:

  1. 有 10 个 server pod;
  2. 某次迭代存在不兼容的 SQL 变更。

在滚动升级过程中:

  1. server_1 被拉起,执行 SQL,库表格式发生变更;
  2. server_2 ~ server_10 因库表不兼容,无法处理逻辑;
  3. 每个 server 滚动更新时间在 3-5min 左右,那:
    • 完整更新整个 Deployment 需要 3min * 10 = 30min
    • 第一个 3 min 过去后,将只存在 1 个可用的 server,其余 9 个 server 待更新;
    • 剩余的 27 分钟内可用 server 逐步变多,最终恢复。

举这个例子并不是说,会存在不向前兼容的 SQL,相信开发者在改表时都会留心。但是启动时间过长总会引起类似的其他问题,会是个隐患。

这个 issue 的目的是讨论一下启动的流程,梳理清楚是否有可优化的地方,给官方或者社区去做。

另外补充一些背景: 目前 Server 缺乏动态调整配置的能力(配置中心也好,deepflow-ctl 也好,都是未来可能的方案),如果某个 exporter 需要调整配置(例如在高峰期手动屏蔽部分 protocol 数据的导出)需要修改 configMap & kubectl rollout restart。线上频繁 restart 的风险是比较高、不可控的,health check 也未能体现真实的可用状态,所以才有了这个 issue

jiekun avatar Sep 26 '23 02:09 jiekun

嗯,另外我个人还是比较推荐直接Recreate 更新,而不是采用滚动更新,这样会更快一些

spec:
  strategy:
    type: Recreate

@lzf575 ingester貌似启动的比较慢,有什么可以优化的地方吗?

@jiekun server有一个地方可能需要优化下,我们管这个功能叫做k8s leader-election,主要是server需要选举一个leader来处理一些操作数据库等操作,但是现在server每次重建,新的server都需要60秒才会选举成功,这60秒主要是确认上一个leader完全没有续期leader,确认上一个leader真正的挂掉了,这块我暂时还没有想到什么好的解决办法能够优化

Nick-0314 avatar Sep 26 '23 02:09 Nick-0314

@dundun9 ingester 没用写CK 集群的接口,而是直接写CK的endpoints。 ingester比较慢,先要等待Controller获取所有 server 信息, 然后计算server和CK的对应关系, 才能确定ingester写入哪些CK, 之后ingester才启动。

lzf575 avatar Sep 26 '23 06:09 lzf575

@dundun9 ingester 没用写CK 集群的接口,而是直接写CK的endpoints。 ingester比较慢,先要等待Controller获取所有 server 信息, 然后计算server和CK的对应关系, 才能确定ingester写入哪些CK, 之后ingester才启动。

有没有优化空间? 获取ep应该很快,获取全量server慢?

Nick-0314 avatar Sep 26 '23 06:09 Nick-0314

@SongZhen0704 Hi Hi Any update of this issue?

jiekun avatar Oct 09 '23 10:10 jiekun

@lzf575 @SongZhen0704 Hi Any update?

jiekun avatar Oct 11 '23 00:10 jiekun

@dundun9 ingester 启动优化下大概从1分半减少为30秒,https://github.com/deepflowio/deepflow/pull/4604 主要还是 controller启动优化 @SongZhen0704

嗯,另外我个人还是比较推荐直接Recreate 更新,而不是采用滚动更新,这样会更快一些

spec:
  strategy:
    type: Recreate

@lzf575 ingester貌似启动的比较慢,有什么可以优化的地方吗?

@jiekun server有一个地方可能需要优化下,我们管这个功能叫做k8s leader-election,主要是server需要选举一个leader来处理一些操作数据库等操作,但是现在server每次重建,新的server都需要60秒才会选举成功,这60秒主要是确认上一个leader完全没有续期leader,确认上一个leader真正的挂掉了,这块我暂时还没有想到什么好的解决办法能够优化

lzf575 avatar Nov 16 '23 08:11 lzf575

For this issue, two remaining tasks are:

  1. The deepflow-server needs to ensure it's ready before responding to the readiness probe.
  2. Optimization of the controller's startup process, especially in scenarios with a large number of deepflow-server replicas.

@SongZhen0704 we need to review the startup process and time consumption of the controller in scenarios with numerous deepflow-servers, to clearly identify the direction for optimization.

Regarding the requirement for hot updating of server configuration information, @jiekun can submit a separate Feature Request to follow up.

sharang avatar Dec 20 '23 05:12 sharang