Implement Raft or clustering support in eKuiper
@ngjaying Hi eKuiper team I would like to propose the implementation of Raft or another form of clustering/replication in eKuiper.
Currently, eKuiper does not support any form of distributed replication—its state is either in-memory or persisted via local checkpoints. Adding a clustering mechanism would bring fault tolerance and strong, or weaker, consistency across multiple nodes, making the system more robust and suitable for distributed edge deployments.
If you agree, I would be happy to take on the implementation, draft a design proposal, and submit incremental PRs for review. I believe this could be a valuable addition to the project and help broaden its use cases.
Thank you for considering this suggestion, and I look forward to your feedback!
Best regards.
@Elia-Renzoni Thanks for the great suggestion. My concern is that a full, from-scratch Raft implementation would be a major undertaking and may not be the most practical approach. Could you please share the rough design proposal? Thanks.
Hi @ngjaying
My idea is to create a more comprehensive approach to fault tolerance than the one currently available in eKuiper, focusing specifically on improving the existing checkpointing functionality. Conceptually, this can be seen as a broadcast of the logs being checkpointed. This way, if a single machine fails at the hardware level, a backup or follower node can immediately take over and continue processing.
In this design, Raft would operate only when the user has configured checkpointing. In other words, Raft should be considered an optional feature that adds stronger fault tolerance to the checkpointing mechanism. Users could still perform checkpointing without Raft (and therefore without log replication) if they prefer.
It’s also worth noting that Raft should not introduce significant complexity or overhead to the codebase. Many Raft implementations are relatively compact — around 2–3K lines of code — and there are numerous well-tested, production-grade libraries available that can be integrated instead of writing everything from scratch.
The implementation I envision would run entirely in the background — input ingestion performance should not be affected by the log replication process with a majority quorum. It would simply replicate the logs in the background.
This feature should not interfere with the existing checkpointing interval configuration. In other words, the current user-defined intervals remain valid and untouched; during each configured checkpoint, the system would broadcast the logs and verify whether replication succeeded before committing.
Hi @ngjaying
I hope you’re doing well! I wanted to follow up on the design proposal I submitted. I completely understand that reviewing design proposals takes time, and I really appreciate all the work you and the team do on eKuiper.
Now that I have more time, I’m happy to dedicate effort to work on this proposal if you think the design is aligned with the project’s direction. If not, I can still work on it as a side project, always in accordance with the Apache 2.0 license guidelines. Any feedback, suggestions, or guidance you can provide whenever you have a chance would be greatly appreciated.
Thank you very much for your time and attention, and I look forward to hearing your thoughts!