rollmint icon indicating copy to clipboard operation
rollmint copied to clipboard

ADR: HA failover

Open tac0turtle opened this issue 4 months ago • 7 comments

Overview

This PR proposes an approach to having a HA failover system in the off chance a sequencer goes down

tac0turtle avatar Aug 21 '25 14:08 tac0turtle

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped❌ failed (14)✅ passedAug 29, 2025, 8:44 AM

github-actions[bot] avatar Aug 21 '25 14:08 github-actions[bot]

PR Preview Action v1.6.2 :---: Preview removed because the pull request was closed. 2025-11-12 09:57 UTC

github-actions[bot] avatar Aug 21 '25 14:08 github-actions[bot]

Claude finished @tac0turtle's task —— View job


Pull Request Review

I've thoroughly reviewed this ADR for HA sequencer failover. Overall, this is a well-structured and comprehensive architectural decision record that addresses a critical production need. Here's my detailed feedback:

Todo List:

  • [x] Read and analyze ADR-023-sequencer-recovery.md
  • [x] Read and analyze control.proto file
  • [x] Review architectural design decisions
  • [x] Check for security considerations
  • [x] Evaluate performance implications
  • [x] Review test coverage approach
  • [x] Provide comprehensive feedback

Code Quality & Best Practices

Strengths:

  • Excellent documentation structure: The ADR follows proper format with clear context, alternatives, decision rationale, and consequences
  • Well-defined API: The protobuf schema is clean, well-commented, and follows good naming conventions
  • Comprehensive coverage: Addresses operational concerns, security, testing, and monitoring
  • Industry alignment: References OP Stack's op-conductor showing awareness of proven patterns

Areas for improvement:

  1. Missing API versioning strategy: While the proto package includes v1, there's no discussion of how breaking changes to the admin API will be handled in future versions

  2. Error handling specificity: The error semantics section could benefit from more specific error codes for common failure scenarios (e.g., different types of fencing violations)

Security Review

Strong points:

  • mTLS + RBAC authentication properly specified
  • Fencing mechanisms to prevent split-brain scenarios
  • Audit logging for all critical operations
  • Admin API properly isolated from public endpoints

Concerns:

  1. lease_token security (control.proto:45, 57, 68): The lease token is defined as bytes but lacks specification of:

    • Token generation/validation algorithms
    • Expiration mechanisms
    • Cryptographic properties (should it be signed? encrypted?)
  2. Break-glass procedures (ADR-023-sequencer-recovery.md:204): Mentioned but not detailed. Critical security operations need explicit procedures and access controls

  3. Handoff ticket security (control.proto:75): Similar to lease tokens, handoff tickets need cryptographic specification to prevent forgery

Performance Considerations

Positive aspects:

  • Acknowledges minimal overhead for both designs
  • Realistic SLO targets (p50 ≤ 15s, p95 ≤ 45s failover)
  • Separate admin API prevents performance impact on public endpoints

Potential issues:

  1. Raft snapshot frequency: No specification of snapshot intervals which could impact memory usage and recovery time
  2. Lease renewal frequency: Design B lacks specification of renewal intervals vs. detection time trade-offs

Architectural Review

Design strengths:

  • Clear separation between control plane (Raft/Lease) and data plane (ev-node)
  • Flexible approach supporting both designs based on operational needs
  • UnsafeHead tracking ensures proper state continuity
  • Two-phase handoff (PrepareHandoff/CompleteHandoff) enables zero-downtime upgrades

Missing considerations:

  1. Observability gaps:

    • No specification for health check intervals
    • Missing metrics for handoff duration/success rates
    • No alerting thresholds defined
  2. Edge case handling:

    • What happens if both nodes fail simultaneously?
    • Recovery procedures when control plane data is corrupted
    • Network partition scenarios beyond what's briefly mentioned
  3. State synchronization: How does a standby node ensure it's caught up before promotion? The ADR mentions this briefly but lacks implementation details.

Test Coverage Assessment

Good coverage planned:

  • Kill tests for failover validation
  • Partition tests for split-brain prevention
  • Blue/green handoff testing
  • Long-duration outage drills

Missing test scenarios:

  1. Concurrent API calls: What happens if multiple StartSequencer calls happen simultaneously?
  2. Token/ticket replay attacks: Security testing for cryptographic components
  3. Performance degradation testing: How does the system behave under resource pressure?
  4. Configuration drift: Testing mismatched configurations between nodes

Technical Issues

  1. Proto field consistency: The HealthResponse message uses uint64 for l2_number while UnsafeHead uses the same type - good consistency.

  2. Missing validation: No specification of input validation (e.g., target_id format in PrepareHandoffRequest)

  3. Idempotency implementation: While idempotency keys are included, there's no specification of how they're validated or stored

Recommendations

High Priority:

  1. Add cryptographic specifications for lease_token and handoff_ticket
  2. Define break-glass procedures and access controls
  3. Specify health check and lease renewal intervals
  4. Add input validation requirements

Medium Priority:

  1. Define API versioning and backward compatibility strategy
  2. Add more detailed error scenarios and recovery procedures
  3. Specify observability metrics thresholds and alerting rules
  4. Add edge case handling procedures

Low Priority:

  1. Consider adding request/response size limits
  2. Add performance benchmarking requirements to testing plan

Overall Assessment

This is a solid architectural foundation for HA sequencer failover. The two-design approach is pragmatic, and the separation of concerns is well thought out. The main gaps are in security specifications and operational details, but the core design is sound and implementable.

The ADR demonstrates good understanding of distributed systems challenges and provides a clear path forward for implementation.

claude[bot] avatar Aug 25 '25 09:08 claude[bot]

Just to clarify, there will be n nodes from the same "operator".

They will be running all the time, only one will run as aggregator, the rest are not (but still run with the same priv node key).

In case someone detects that the aggregator has a problem, another of those n nodes that is not an aggregator will be notified via RAFT network, and will take the role of aggregator.

Questions

  • Is it possible now to run two nodes with same priv key even if one is not validator?
  • How a node becomes aggregator, by restarting it with the aggregator flag?

randygrok avatar Aug 25 '25 09:08 randygrok

  • Is it possible now to run two nodes with same priv key even if one is not validator?

yes it is, there will be a new api that enables the block production

  • How a node becomes aggregator, by restarting it with the aggregator flag?

the new api, its missing from the ADR, ill look at adding it into the ADR to provide more information

tac0turtle avatar Aug 25 '25 12:08 tac0turtle

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 72.30%. Comparing base (f483445) to head (5e0e181). :warning: Report is 149 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2598      +/-   ##
==========================================
- Coverage   72.41%   72.30%   -0.11%     
==========================================
  Files          72       72              
  Lines        7394     7406      +12     
==========================================
+ Hits         5354     5355       +1     
- Misses       1600     1611      +11     
  Partials      440      440              
Flag Coverage Δ
combined 72.30% <ø> (-0.11%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Aug 29 '25 08:08 codecov[bot]

Code Review for PR #2598: ADR - HA Failover

Thank you for this well-structured ADR proposing a high-availability failover system for the sequencer. This is a critical component for production readiness.

✅ Strengths

Architecture & Design:

  • Excellent decision to support both Rafted Conductor (for ≥3 nodes) and simple lease-based failover (for 2 nodes)
  • Clear separation of concerns with the Admin Control API on a separate port with mTLS
  • Well-defined fencing mechanisms to prevent split-brain scenarios
  • Thoughtful state machine diagrams illustrating the handoff and failover flows
  • Good alignment with industry patterns (citing OP Stack's similar approach)

Documentation Quality:

  • Comprehensive ADR following the standard template
  • Clear problem statement and alternative approaches considered
  • Detailed security considerations including mTLS and RBAC requirements
  • Well-structured protobuf definitions with appropriate error semantics

🔍 Suggestions & Considerations

1. Protobuf Improvements:

  • Consider adding field validation annotations (e.g., [(validate.rules).bytes.len = 32] for block_hash)
  • The timestamp field in UnsafeHead should consider using google.protobuf.Timestamp for better compatibility
  • Add field documentation comments directly in the proto file for better IDE support

2. Error Handling & Edge Cases:

  • The ADR mentions "Long-duration outage drills" but doesn't specify recovery behavior when all nodes have been down
  • Consider adding explicit handling for clock skew between nodes (especially important for lease expiry)
  • What happens if a node receives a CompleteHandoff but hasn't synced the referenced UnsafeHead?

3. Operational Concerns:

  • The 15s p50 failover target seems reasonable, but consider documenting the breakdown (detection time + election + startup)
  • Add metrics for handoff duration to validate zero-downtime claim
  • Consider a DryRun mode for PrepareHandoff to validate readiness without committing

4. Security Enhancements:

  • The lease_token and handoff_ticket are marked as bytes - consider documenting the expected format (JWT? signed protobuf?)
  • Add rate limiting considerations for the Admin API to prevent DoS
  • Consider adding an audit log service or structured logging requirements for all admin operations

5. Testing Gaps:

  • No mention of chaos engineering tests (network delays, clock skew, partial failures)
  • Missing performance benchmarks for Raft consensus overhead
  • Consider adding integration test specifications with the existing health service

6. Implementation Phasing:

  • Phase 1 mentions "sidecar skeletons" - consider if these should be in separate repos or part of ev-node
  • The ADR doesn't specify backward compatibility during the rollout phases

🚨 Potential Issues

  1. Resource Cleanup: The StopSequencer RPC doesn't specify cleanup behavior for in-flight transactions or DA submissions
  2. Observability Gap: No distributed tracing mentioned - critical for debugging failover latencies
  3. Configuration Management: No mention of how Raft cluster membership changes are handled (adding/removing nodes)

📈 Performance Considerations

  • Raft heartbeat intervals will need tuning based on network latency
  • Consider documenting the expected storage requirements for Raft snapshots
  • The protobuf design is efficient with appropriate use of primitive types

🔒 Security Review

  • ✅ Good: Separate Admin API with mTLS requirement
  • ✅ Good: Fencing tokens to prevent split-brain
  • ⚠️ Consider: Adding request signing/HMAC for replay protection
  • ⚠️ Consider: Encrypted storage for Raft logs if they contain sensitive metadata

🧪 Test Coverage Recommendations

Since this is an ADR without implementation, I recommend creating comprehensive test specifications for:

  • Unit tests for state machine transitions
  • Integration tests with mock DA and executor
  • E2E tests simulating various failure scenarios
  • Load tests to validate no performance regression
  • Security tests for authentication bypass attempts

Overall Assessment

This is a solid architectural proposal that demonstrates good understanding of distributed systems challenges. The dual-mode approach (Raft vs Lease) provides flexibility, and the detailed sequence diagrams show careful consideration of the handoff flows. With the suggested improvements, particularly around operational observability and comprehensive testing, this will provide a robust foundation for HA sequencing.

Recommendation: Approved with suggestions. Consider addressing the operational and testing gaps before moving to implementation Phase 2.

claude[bot] avatar Sep 25 '25 09:09 claude[bot]

closing this as alpe has taken over the design and implementation

tac0turtle avatar Nov 12 '25 09:11 tac0turtle