ADR: HA failover
Overview
This PR proposes an approach to having a HA failover system in the off chance a sequencer goes down
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
| Build | Format | Lint | Breaking | Updated (UTC) |
|---|---|---|---|---|
✅ passed | ⏩ skipped | ❌ failed (14) | ✅ passed | Aug 29, 2025, 8:44 AM |
PR Preview Action v1.6.2 :---: Preview removed because the pull request was closed. 2025-11-12 09:57 UTC
Claude finished @tac0turtle's task —— View job
Pull Request Review
I've thoroughly reviewed this ADR for HA sequencer failover. Overall, this is a well-structured and comprehensive architectural decision record that addresses a critical production need. Here's my detailed feedback:
Todo List:
- [x] Read and analyze ADR-023-sequencer-recovery.md
- [x] Read and analyze control.proto file
- [x] Review architectural design decisions
- [x] Check for security considerations
- [x] Evaluate performance implications
- [x] Review test coverage approach
- [x] Provide comprehensive feedback
Code Quality & Best Practices
Strengths:
- Excellent documentation structure: The ADR follows proper format with clear context, alternatives, decision rationale, and consequences
- Well-defined API: The protobuf schema is clean, well-commented, and follows good naming conventions
- Comprehensive coverage: Addresses operational concerns, security, testing, and monitoring
- Industry alignment: References OP Stack's
op-conductorshowing awareness of proven patterns
Areas for improvement:
-
Missing API versioning strategy: While the proto package includes
v1, there's no discussion of how breaking changes to the admin API will be handled in future versions -
Error handling specificity: The error semantics section could benefit from more specific error codes for common failure scenarios (e.g., different types of fencing violations)
Security Review
Strong points:
- mTLS + RBAC authentication properly specified
- Fencing mechanisms to prevent split-brain scenarios
- Audit logging for all critical operations
- Admin API properly isolated from public endpoints
Concerns:
-
lease_tokensecurity (control.proto:45, 57, 68): The lease token is defined asbytesbut lacks specification of:- Token generation/validation algorithms
- Expiration mechanisms
- Cryptographic properties (should it be signed? encrypted?)
-
Break-glass procedures (ADR-023-sequencer-recovery.md:204): Mentioned but not detailed. Critical security operations need explicit procedures and access controls
-
Handoff ticket security (control.proto:75): Similar to lease tokens, handoff tickets need cryptographic specification to prevent forgery
Performance Considerations
Positive aspects:
- Acknowledges minimal overhead for both designs
- Realistic SLO targets (p50 ≤ 15s, p95 ≤ 45s failover)
- Separate admin API prevents performance impact on public endpoints
Potential issues:
- Raft snapshot frequency: No specification of snapshot intervals which could impact memory usage and recovery time
- Lease renewal frequency: Design B lacks specification of renewal intervals vs. detection time trade-offs
Architectural Review
Design strengths:
- Clear separation between control plane (Raft/Lease) and data plane (ev-node)
- Flexible approach supporting both designs based on operational needs
- UnsafeHead tracking ensures proper state continuity
- Two-phase handoff (PrepareHandoff/CompleteHandoff) enables zero-downtime upgrades
Missing considerations:
-
Observability gaps:
- No specification for health check intervals
- Missing metrics for handoff duration/success rates
- No alerting thresholds defined
-
Edge case handling:
- What happens if both nodes fail simultaneously?
- Recovery procedures when control plane data is corrupted
- Network partition scenarios beyond what's briefly mentioned
-
State synchronization: How does a standby node ensure it's caught up before promotion? The ADR mentions this briefly but lacks implementation details.
Test Coverage Assessment
Good coverage planned:
- Kill tests for failover validation
- Partition tests for split-brain prevention
- Blue/green handoff testing
- Long-duration outage drills
Missing test scenarios:
- Concurrent API calls: What happens if multiple StartSequencer calls happen simultaneously?
- Token/ticket replay attacks: Security testing for cryptographic components
- Performance degradation testing: How does the system behave under resource pressure?
- Configuration drift: Testing mismatched configurations between nodes
Technical Issues
-
Proto field consistency: The
HealthResponsemessage usesuint64forl2_numberwhileUnsafeHeaduses the same type - good consistency. -
Missing validation: No specification of input validation (e.g.,
target_idformat in PrepareHandoffRequest) -
Idempotency implementation: While idempotency keys are included, there's no specification of how they're validated or stored
Recommendations
High Priority:
- Add cryptographic specifications for
lease_tokenandhandoff_ticket - Define break-glass procedures and access controls
- Specify health check and lease renewal intervals
- Add input validation requirements
Medium Priority:
- Define API versioning and backward compatibility strategy
- Add more detailed error scenarios and recovery procedures
- Specify observability metrics thresholds and alerting rules
- Add edge case handling procedures
Low Priority:
- Consider adding request/response size limits
- Add performance benchmarking requirements to testing plan
Overall Assessment
This is a solid architectural foundation for HA sequencer failover. The two-design approach is pragmatic, and the separation of concerns is well thought out. The main gaps are in security specifications and operational details, but the core design is sound and implementable.
The ADR demonstrates good understanding of distributed systems challenges and provides a clear path forward for implementation.
Just to clarify, there will be n nodes from the same "operator".
They will be running all the time, only one will run as aggregator, the rest are not (but still run with the same priv node key).
In case someone detects that the aggregator has a problem, another of those n nodes that is not an aggregator will be notified via RAFT network, and will take the role of aggregator.
Questions
- Is it possible now to run two nodes with same priv key even if one is not validator?
- How a node becomes aggregator, by restarting it with the aggregator flag?
- Is it possible now to run two nodes with same priv key even if one is not validator?
yes it is, there will be a new api that enables the block production
- How a node becomes aggregator, by restarting it with the aggregator flag?
the new api, its missing from the ADR, ill look at adding it into the ADR to provide more information
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 72.30%. Comparing base (f483445) to head (5e0e181).
:warning: Report is 149 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2598 +/- ##
==========================================
- Coverage 72.41% 72.30% -0.11%
==========================================
Files 72 72
Lines 7394 7406 +12
==========================================
+ Hits 5354 5355 +1
- Misses 1600 1611 +11
Partials 440 440
| Flag | Coverage Δ | |
|---|---|---|
| combined | 72.30% <ø> (-0.11%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Code Review for PR #2598: ADR - HA Failover
Thank you for this well-structured ADR proposing a high-availability failover system for the sequencer. This is a critical component for production readiness.
✅ Strengths
Architecture & Design:
- Excellent decision to support both Rafted Conductor (for ≥3 nodes) and simple lease-based failover (for 2 nodes)
- Clear separation of concerns with the Admin Control API on a separate port with mTLS
- Well-defined fencing mechanisms to prevent split-brain scenarios
- Thoughtful state machine diagrams illustrating the handoff and failover flows
- Good alignment with industry patterns (citing OP Stack's similar approach)
Documentation Quality:
- Comprehensive ADR following the standard template
- Clear problem statement and alternative approaches considered
- Detailed security considerations including mTLS and RBAC requirements
- Well-structured protobuf definitions with appropriate error semantics
🔍 Suggestions & Considerations
1. Protobuf Improvements:
- Consider adding field validation annotations (e.g.,
[(validate.rules).bytes.len = 32]for block_hash) - The
timestampfield inUnsafeHeadshould consider usinggoogle.protobuf.Timestampfor better compatibility - Add field documentation comments directly in the proto file for better IDE support
2. Error Handling & Edge Cases:
- The ADR mentions "Long-duration outage drills" but doesn't specify recovery behavior when all nodes have been down
- Consider adding explicit handling for clock skew between nodes (especially important for lease expiry)
- What happens if a node receives a
CompleteHandoffbut hasn't synced the referencedUnsafeHead?
3. Operational Concerns:
- The 15s p50 failover target seems reasonable, but consider documenting the breakdown (detection time + election + startup)
- Add metrics for handoff duration to validate zero-downtime claim
- Consider a
DryRunmode forPrepareHandoffto validate readiness without committing
4. Security Enhancements:
- The
lease_tokenandhandoff_ticketare marked asbytes- consider documenting the expected format (JWT? signed protobuf?) - Add rate limiting considerations for the Admin API to prevent DoS
- Consider adding an audit log service or structured logging requirements for all admin operations
5. Testing Gaps:
- No mention of chaos engineering tests (network delays, clock skew, partial failures)
- Missing performance benchmarks for Raft consensus overhead
- Consider adding integration test specifications with the existing health service
6. Implementation Phasing:
- Phase 1 mentions "sidecar skeletons" - consider if these should be in separate repos or part of ev-node
- The ADR doesn't specify backward compatibility during the rollout phases
🚨 Potential Issues
- Resource Cleanup: The
StopSequencerRPC doesn't specify cleanup behavior for in-flight transactions or DA submissions - Observability Gap: No distributed tracing mentioned - critical for debugging failover latencies
- Configuration Management: No mention of how Raft cluster membership changes are handled (adding/removing nodes)
📈 Performance Considerations
- Raft heartbeat intervals will need tuning based on network latency
- Consider documenting the expected storage requirements for Raft snapshots
- The protobuf design is efficient with appropriate use of primitive types
🔒 Security Review
- ✅ Good: Separate Admin API with mTLS requirement
- ✅ Good: Fencing tokens to prevent split-brain
- ⚠️ Consider: Adding request signing/HMAC for replay protection
- ⚠️ Consider: Encrypted storage for Raft logs if they contain sensitive metadata
🧪 Test Coverage Recommendations
Since this is an ADR without implementation, I recommend creating comprehensive test specifications for:
- Unit tests for state machine transitions
- Integration tests with mock DA and executor
- E2E tests simulating various failure scenarios
- Load tests to validate no performance regression
- Security tests for authentication bypass attempts
Overall Assessment
This is a solid architectural proposal that demonstrates good understanding of distributed systems challenges. The dual-mode approach (Raft vs Lease) provides flexibility, and the detailed sequence diagrams show careful consideration of the handoff flows. With the suggested improvements, particularly around operational observability and comprehensive testing, this will provide a robust foundation for HA sequencing.
Recommendation: Approved with suggestions. Consider addressing the operational and testing gaps before moving to implementation Phase 2.
closing this as alpe has taken over the design and implementation