hertzbeat [Question] Questions about Metrics Collection Architecture

Question

I am curious about a few points related to the system architecture:

Collectors can be scaled up easily, but what about the manager?
- If not, is there any plan to improve or change the architecture to better support horizontal scaling in the future?
Is there a specific reason for using a persistent custom TCP protocol between the manager and collectors?
- Was using HTTP APIs considered but rejected due to performance or efficiency concerns?

Just wanted to understand the reasoning behind these choices and whether any future improvements are being planned in these areas.

Thanks in advance!

Apr 07 '25 13:04 JuJinPark

hi, the manager currently does not have clustering and high availability, which is something we should consider later. How about using raft to implement manager clustering? We use protobuffer based on TCP upper layer with small size and high transmission, can customize heartbeat, task dispatch etc.

Apr 13 '25 12:04 tomsun28

@tomsun28 This is one possible suggestion for a new architecture

Brief Explanation

Shared Storage (e.g., Redis): Register active collectors and their heartbeats Maintain a consistent hash ring (updated when collector list changes)
Queues (e.g., Kafka or Redis) for Communication: JobDispatchQueue: Distribute collection jobs (1 topic/partition per collector) MetricJobQueue: Receive collected metric results
Manager Becomes Stateless: Periodically reads the collector list and hash ring from shared storage Assigns jobs by publishing to the appropriate queue Pulls results, runs alarming logic, and writes to history DB
Collector Becomes Async Worker: Subscribes to its own topic/queue Pulls jobs, collects metrics, sends results to MetricJobQueue

Considerations

Ensure consistent hash ring state across Manager instances
Handle possible job duplication or misrouting during failover or rebalancing
Adds infrastructure complexity (Kafka, Redis need to be highly available)
Potential increase in end-to-end latency compared to current push model

Please treat this as just one suggestion to start a discussion. 😄 I understand that adopting a new architecture would require significant changes, testing, and collaboration from many contributors.

Looking forward to hearing thoughts from your team and other contributors 🙌

Apr 15 '25 11:04 JuJinPark

hi very detailed and carefully designed. 👍 I have some questions about this. The Shared Storage is a program or the redis to control the collector hearbeat and hash-cyclic dispatch ? Seem if we want to achieve distributed multi-managers and high availability, it seems that we need a highly available cluster of Kafka and a highly available redis cluster to build this. I personally feel that the architecture is too complicated, which makes it difficult for users to maintain. Multiple external dependencies also expose more potential points of failure.

How about we just use the multi-managers? use the raft protocol to implement one-leader-manager, multi-follower-managers strument.

Regarding the consistent hash ring state, we can let the followers synchronize the data of the primary leader.
The collector will connect to any one manager, but they will periodically synchronize metadata information to know who the current leader manager is.
The leader controls the changes to public data, and other nodes only follow and replicate.
When the leader goes down, the Raft protocol elects a new leader node and continues, and also notifies the collector.
We also need to modify the manager to keep other data as stateless as possible.
so on

This is just my simple idea. Maybe there are better solutions. Welcome to discuss.

Apr 16 '25 15:04 tomsun28

hi 👋 Thanks for sharing the idea 👍 — I find the direction very interesting and agree with your point about reducing complexity and minimizing external dependencies.

I’d love to better understand how this Raft-based model would work in practice.

Could you explain more about the specific roles and responsibilities of the Leader and Follower Managers?
- Is the Leader solely responsible for maintaining the consistent hash ring, assigning jobs, and tracking collectors?
- Do Followers replicate this state and just wait for failover? Or can they serve other functions?

And could you walk through a few specific flows, such as:

How do collectors initially connect — do they always connect directly to the leader, or to any manager and get redirected?
How are collection jobs assigned and pushed to collectors?
How is metric data returned and handled in this model?

Apr 18 '25 06:04 JuJinPark

hi👍 This is just my initial idea. I will conduct in-depth research on it when have time. We can see some manager data are stateful data, eg: the consisten hash ring, cluster state info. These stateful data need be maintan by Leader and the Followers synchronously copy this data from the Leader to ultimately maintain data consistency. The Leader down, the Follower one of them become the new Leader.

How do collectors initially connect — do they always connect directly to the leader, or to any manager and get redirected?

I think the collectors can connect anyone of the managers, due the consisten hash ring state is same in all manager. Maybe here we can learn from kafka-client brokers cluster to see how them design.

How are collection jobs assigned and pushed to collectors?

As above, The Leader decides the jobs assigned and collectors, the followers sync copy data from Leader, and then implement the actual allocation. But here we still need to consider, if the collector registers any manager, how does the leader know which collector has registered?

How is metric data returned and handled in this model?

Maybe it will be the same as now(the collector send the data to connected manager directly), but we need to consider unified data processing and unified alarm calculation

Apr 19 '25 03:04 tomsun28

Thank you for taking the time to answer all my questions — I really appreciate it! 😄 I’m looking forward to hearing more about your in-depth research soon.

One concern I still have is about the core concept of the Leader–Follower model, where the leader handles all writes to maintain consistency, and followers only replicate state and serve reads.

But I think HertzBeat is a write-heavy system, at least from the Manager’s perspective. So the Leader–Follower model could become problematic. For example:

A large and growing number of collectors send frequent heartbeats
Collectors register and reconnect often (especially during scaling or failovers)
Metric collection results are high-frequency writes Even if metric collection results ares pushed elsewhere, I’m concerned the leader could still suffer from too much traffic and become a bottleneck — similar to the original scalability challenge we’re trying to solve.

I also had one more question: 👉 Is assigning a job to a specific collector a core design requirement? Because if not, maybe we could eliminate the consistent hashing entirely — which would allow us to remove centralized state and make it much easier to support a stateless, horizontally scalable Manager cluster.

Apr 19 '25 07:04 JuJinPark

Hi the assigning jobs to a specific colletor is requirement due the Cloud-Edge mode.

Apr 19 '25 13:04 tomsun28

Oh right — I totally forgot about the Cloud-Edge mode. Thanks for the reminder!

I’m very interested in exploring more scalable architecture options, so please let me know when this topic becomes active — whether your team or other contributors begin working on it or start discussing it more in depth.

I understand this might not be a high-priority task at the moment and that there are probably more important jobs to focus on first.

In the meantime, I’ll also think about other possible approaches. If I come up with anything valuable, I’ll be sure to share it here. Thanks

Apr 20 '25 12:04 JuJinPark