[BUG] Server start too slow
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
DeepFlow Component
Server
What you expected to happen
DeepFlow server runs a lot of preparation logic when starting. It takes several minutes before it could really process the messages from agent.
We have 20+ server pod. Once they restart (reboot) / crash / rolling update, there will be a significantly long time period that "no one is actually working":
- The health check become green once the querier is ready. It return
200so the dashboard on Grafana is available. But the server itself is not ready for processing new message. - When the healty check returns
200, Kubernetes consider it's ready and destory the previous pod. And another new pod is scheduled and trying to be ready. - So within a minute or two, all pods are restarted, reach
200status, but no one is working.
Essentially we expect:
- Server could be ready within a minute.
- Do not claim itself
healthybefore everything is ready.
How to reproduce
Restart the DeepFlow server.
DeepFlow version
Any
DeepFlow agent list
Not relevant
Kubernetes CNI
No response
Operation-System/Kernel version
Not relevant
Anything else
Not relevant
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
@lzf575 可能要考虑下 ingester也接入健康检查接口,ingester就绪之后,健康检查接口再返回200
其实主要问题在与如何减少启动所需的时间,我还不太熟悉其中流程,选主、执行 SQL 等等,他们看起来花费的时间太过长了。
假设如下场景:
- 有 10 个 server pod;
- 某次迭代存在不兼容的 SQL 变更。
在滚动升级过程中:
- server_1 被拉起,执行 SQL,库表格式发生变更;
- server_2 ~ server_10 因库表不兼容,无法处理逻辑;
- 每个 server 滚动更新时间在 3-5min 左右,那:
- 完整更新整个 Deployment 需要 3min * 10 = 30min
- 第一个 3 min 过去后,将只存在 1 个可用的 server,其余 9 个 server 待更新;
- 剩余的 27 分钟内可用 server 逐步变多,最终恢复。
举这个例子并不是说,会存在不向前兼容的 SQL,相信开发者在改表时都会留心。但是启动时间过长总会引起类似的其他问题,会是个隐患。
这个 issue 的目的是讨论一下启动的流程,梳理清楚是否有可优化的地方,给官方或者社区去做。
另外补充一些背景:
目前 Server 缺乏动态调整配置的能力(配置中心也好,deepflow-ctl 也好,都是未来可能的方案),如果某个 exporter 需要调整配置(例如在高峰期手动屏蔽部分 protocol 数据的导出)需要修改 configMap & kubectl rollout restart。线上频繁 restart 的风险是比较高、不可控的,health check 也未能体现真实的可用状态,所以才有了这个 issue
嗯,另外我个人还是比较推荐直接Recreate 更新,而不是采用滚动更新,这样会更快一些
spec:
strategy:
type: Recreate
@lzf575 ingester貌似启动的比较慢,有什么可以优化的地方吗?
@jiekun server有一个地方可能需要优化下,我们管这个功能叫做k8s leader-election,主要是server需要选举一个leader来处理一些操作数据库等操作,但是现在server每次重建,新的server都需要60秒才会选举成功,这60秒主要是确认上一个leader完全没有续期leader,确认上一个leader真正的挂掉了,这块我暂时还没有想到什么好的解决办法能够优化
@dundun9 ingester 没用写CK 集群的接口,而是直接写CK的endpoints。 ingester比较慢,先要等待Controller获取所有 server 信息, 然后计算server和CK的对应关系, 才能确定ingester写入哪些CK, 之后ingester才启动。
@dundun9 ingester 没用写CK 集群的接口,而是直接写CK的endpoints。 ingester比较慢,先要等待Controller获取所有 server 信息, 然后计算server和CK的对应关系, 才能确定ingester写入哪些CK, 之后ingester才启动。
有没有优化空间? 获取ep应该很快,获取全量server慢?
@SongZhen0704 Hi Hi Any update of this issue?
@lzf575 @SongZhen0704 Hi Any update?
@dundun9 ingester 启动优化下大概从1分半减少为30秒,https://github.com/deepflowio/deepflow/pull/4604 主要还是 controller启动优化 @SongZhen0704
嗯,另外我个人还是比较推荐直接Recreate 更新,而不是采用滚动更新,这样会更快一些
spec: strategy: type: Recreate@lzf575 ingester貌似启动的比较慢,有什么可以优化的地方吗?
@jiekun server有一个地方可能需要优化下,我们管这个功能叫做
k8s leader-election,主要是server需要选举一个leader来处理一些操作数据库等操作,但是现在server每次重建,新的server都需要60秒才会选举成功,这60秒主要是确认上一个leader完全没有续期leader,确认上一个leader真正的挂掉了,这块我暂时还没有想到什么好的解决办法能够优化
For this issue, two remaining tasks are:
- The deepflow-server needs to ensure it's ready before responding to the readiness probe.
- Optimization of the controller's startup process, especially in scenarios with a large number of deepflow-server replicas.
@SongZhen0704 we need to review the startup process and time consumption of the controller in scenarios with numerous deepflow-servers, to clearly identify the direction for optimization.
Regarding the requirement for hot updating of server configuration information, @jiekun can submit a separate Feature Request to follow up.