rollmint ADR: HA failover

Overview

This PR proposes an approach to having a HA failover system in the off chance a sequencer goes down

Aug 21 '25 14:08 tac0turtle

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`❌ failed (14)`	`✅ passed`	Aug 29, 2025, 8:44 AM

Aug 21 '25 14:08 github-actions[bot]

PR Preview Action v1.6.2 :---: Preview removed because the pull request was closed. 2025-11-12 09:57 UTC

Aug 21 '25 14:08 github-actions[bot]

Claude finished @tac0turtle's task —— View job

Pull Request Review

I've thoroughly reviewed this ADR for HA sequencer failover. Overall, this is a well-structured and comprehensive architectural decision record that addresses a critical production need. Here's my detailed feedback:

Todo List:

[x] Read and analyze ADR-023-sequencer-recovery.md
[x] Read and analyze control.proto file
[x] Review architectural design decisions
[x] Check for security considerations
[x] Evaluate performance implications
[x] Review test coverage approach
[x] Provide comprehensive feedback

Code Quality & Best Practices

Strengths:

Excellent documentation structure: The ADR follows proper format with clear context, alternatives, decision rationale, and consequences
Well-defined API: The protobuf schema is clean, well-commented, and follows good naming conventions
Comprehensive coverage: Addresses operational concerns, security, testing, and monitoring
Industry alignment: References OP Stack's op-conductor showing awareness of proven patterns

Areas for improvement:

Missing API versioning strategy: While the proto package includes v1, there's no discussion of how breaking changes to the admin API will be handled in future versions
Error handling specificity: The error semantics section could benefit from more specific error codes for common failure scenarios (e.g., different types of fencing violations)

Security Review

Strong points:

mTLS + RBAC authentication properly specified
Fencing mechanisms to prevent split-brain scenarios
Audit logging for all critical operations
Admin API properly isolated from public endpoints

Concerns:

lease_token security (control.proto:45, 57, 68): The lease token is defined as bytes but lacks specification of:
- Token generation/validation algorithms
- Expiration mechanisms
- Cryptographic properties (should it be signed? encrypted?)
Break-glass procedures (ADR-023-sequencer-recovery.md:204): Mentioned but not detailed. Critical security operations need explicit procedures and access controls
Handoff ticket security (control.proto:75): Similar to lease tokens, handoff tickets need cryptographic specification to prevent forgery

Performance Considerations

Positive aspects:

Acknowledges minimal overhead for both designs
Realistic SLO targets (p50 ≤ 15s, p95 ≤ 45s failover)
Separate admin API prevents performance impact on public endpoints

Potential issues:

Raft snapshot frequency: No specification of snapshot intervals which could impact memory usage and recovery time
Lease renewal frequency: Design B lacks specification of renewal intervals vs. detection time trade-offs

Architectural Review

Design strengths:

Clear separation between control plane (Raft/Lease) and data plane (ev-node)
Flexible approach supporting both designs based on operational needs
UnsafeHead tracking ensures proper state continuity
Two-phase handoff (PrepareHandoff/CompleteHandoff) enables zero-downtime upgrades

Missing considerations:

Observability gaps:
- No specification for health check intervals
- Missing metrics for handoff duration/success rates
- No alerting thresholds defined
Edge case handling:
- What happens if both nodes fail simultaneously?
- Recovery procedures when control plane data is corrupted
- Network partition scenarios beyond what's briefly mentioned
State synchronization: How does a standby node ensure it's caught up before promotion? The ADR mentions this briefly but lacks implementation details.

Test Coverage Assessment

Good coverage planned:

Kill tests for failover validation
Partition tests for split-brain prevention
Blue/green handoff testing
Long-duration outage drills

Missing test scenarios:

Concurrent API calls: What happens if multiple StartSequencer calls happen simultaneously?
Token/ticket replay attacks: Security testing for cryptographic components
Performance degradation testing: How does the system behave under resource pressure?
Configuration drift: Testing mismatched configurations between nodes

Technical Issues

Proto field consistency: The HealthResponse message uses uint64 for l2_number while UnsafeHead uses the same type - good consistency.
Missing validation: No specification of input validation (e.g., target_id format in PrepareHandoffRequest)
Idempotency implementation: While idempotency keys are included, there's no specification of how they're validated or stored

Recommendations

High Priority:

Add cryptographic specifications for lease_token and handoff_ticket
Define break-glass procedures and access controls
Specify health check and lease renewal intervals
Add input validation requirements

Medium Priority:

Define API versioning and backward compatibility strategy
Add more detailed error scenarios and recovery procedures
Specify observability metrics thresholds and alerting rules
Add edge case handling procedures

Low Priority:

Consider adding request/response size limits
Add performance benchmarking requirements to testing plan

Overall Assessment

This is a solid architectural foundation for HA sequencer failover. The two-design approach is pragmatic, and the separation of concerns is well thought out. The main gaps are in security specifications and operational details, but the core design is sound and implementable.

The ADR demonstrates good understanding of distributed systems challenges and provides a clear path forward for implementation.

Aug 25 '25 09:08 claude[bot]

Just to clarify, there will be n nodes from the same "operator".

They will be running all the time, only one will run as aggregator, the rest are not (but still run with the same priv node key).

In case someone detects that the aggregator has a problem, another of those n nodes that is not an aggregator will be notified via RAFT network, and will take the role of aggregator.

Questions

Is it possible now to run two nodes with same priv key even if one is not validator?
How a node becomes aggregator, by restarting it with the aggregator flag?

Aug 25 '25 09:08 randygrok

Is it possible now to run two nodes with same priv key even if one is not validator?

yes it is, there will be a new api that enables the block production

How a node becomes aggregator, by restarting it with the aggregator flag?

the new api, its missing from the ADR, ill look at adding it into the ADR to provide more information

Aug 25 '25 12:08 tac0turtle

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 72.30%. Comparing base (f483445) to head (5e0e181). :warning: Report is 149 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2598      +/-   ##
==========================================
- Coverage   72.41%   72.30%   -0.11%     
==========================================
  Files          72       72              
  Lines        7394     7406      +12     
==========================================
+ Hits         5354     5355       +1     
- Misses       1600     1611      +11     
  Partials      440      440

Flag	Coverage Δ
combined	`72.30% <ø> (-0.11%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Aug 29 '25 08:08 codecov[bot]

Code Review for PR #2598: ADR - HA Failover

Thank you for this well-structured ADR proposing a high-availability failover system for the sequencer. This is a critical component for production readiness.

✅ Strengths

Architecture & Design:

Excellent decision to support both Rafted Conductor (for ≥3 nodes) and simple lease-based failover (for 2 nodes)
Clear separation of concerns with the Admin Control API on a separate port with mTLS
Well-defined fencing mechanisms to prevent split-brain scenarios
Thoughtful state machine diagrams illustrating the handoff and failover flows
Good alignment with industry patterns (citing OP Stack's similar approach)

Documentation Quality:

Comprehensive ADR following the standard template
Clear problem statement and alternative approaches considered
Detailed security considerations including mTLS and RBAC requirements
Well-structured protobuf definitions with appropriate error semantics

🔍 Suggestions & Considerations

1. Protobuf Improvements:

Consider adding field validation annotations (e.g., [(validate.rules).bytes.len = 32] for block_hash)
The timestamp field in UnsafeHead should consider using google.protobuf.Timestamp for better compatibility
Add field documentation comments directly in the proto file for better IDE support

2. Error Handling & Edge Cases:

The ADR mentions "Long-duration outage drills" but doesn't specify recovery behavior when all nodes have been down
Consider adding explicit handling for clock skew between nodes (especially important for lease expiry)
What happens if a node receives a CompleteHandoff but hasn't synced the referenced UnsafeHead?

3. Operational Concerns:

The 15s p50 failover target seems reasonable, but consider documenting the breakdown (detection time + election + startup)
Add metrics for handoff duration to validate zero-downtime claim
Consider a DryRun mode for PrepareHandoff to validate readiness without committing

4. Security Enhancements:

The lease_token and handoff_ticket are marked as bytes - consider documenting the expected format (JWT? signed protobuf?)
Add rate limiting considerations for the Admin API to prevent DoS
Consider adding an audit log service or structured logging requirements for all admin operations

5. Testing Gaps:

No mention of chaos engineering tests (network delays, clock skew, partial failures)
Missing performance benchmarks for Raft consensus overhead
Consider adding integration test specifications with the existing health service

6. Implementation Phasing:

Phase 1 mentions "sidecar skeletons" - consider if these should be in separate repos or part of ev-node
The ADR doesn't specify backward compatibility during the rollout phases

🚨 Potential Issues

Resource Cleanup: The StopSequencer RPC doesn't specify cleanup behavior for in-flight transactions or DA submissions
Observability Gap: No distributed tracing mentioned - critical for debugging failover latencies
Configuration Management: No mention of how Raft cluster membership changes are handled (adding/removing nodes)

📈 Performance Considerations

Raft heartbeat intervals will need tuning based on network latency
Consider documenting the expected storage requirements for Raft snapshots
The protobuf design is efficient with appropriate use of primitive types

🔒 Security Review

✅ Good: Separate Admin API with mTLS requirement
✅ Good: Fencing tokens to prevent split-brain
⚠️ Consider: Adding request signing/HMAC for replay protection
⚠️ Consider: Encrypted storage for Raft logs if they contain sensitive metadata

🧪 Test Coverage Recommendations

Since this is an ADR without implementation, I recommend creating comprehensive test specifications for:

Unit tests for state machine transitions
Integration tests with mock DA and executor
E2E tests simulating various failure scenarios
Load tests to validate no performance regression
Security tests for authentication bypass attempts

Overall Assessment

This is a solid architectural proposal that demonstrates good understanding of distributed systems challenges. The dual-mode approach (Raft vs Lease) provides flexibility, and the detailed sequence diagrams show careful consideration of the handoff flows. With the suggested improvements, particularly around operational observability and comprehensive testing, this will provide a robust foundation for HA sequencing.

Recommendation: Approved with suggestions. Consider addressing the operational and testing gaps before moving to implementation Phase 2.

Sep 25 '25 09:09 claude[bot]

closing this as alpe has taken over the design and implementation

Nov 12 '25 09:11 tac0turtle