node icon indicating copy to clipboard operation
node copied to clipboard

feat: a sequential batch TSS keysign scheduler for EVM chain

Open ws4charlie opened this issue 2 months ago • 3 comments

Description

Remaining work:

  • [ ] unit tests

This PR implements a sequential batch TSS keysign scheduler for EVM chain, improving outbound speed by 4~5X.

  1. Decouple CCTX process goroutine scheduleCCTX and TSS keysign scheduler goroutine scheduleKeysign.

  2. Use an artificial (deterministic) height instead of real ZetaChain height to create TSS keysign request. This improves outbound performance by 2X.

  3. Schedule TSS keysign for batched digests instead of only one single digest. Reduced total keysign requests number from multiple to only one (per chain).

  4. Schedule TSS keysign by nonce (batched) sequentially without waiting intervals, replacing the existing interval based logic zeta_height % interval == cctx_nonce % interval. This improves outbound performance by 2~3X

The eth withdraw stress test result before: image

The result after: image

Closes https://github.com/zeta-chain/node/issues/4436

How Has This Been Tested?

  • [x] Tested CCTX in localnet
  • [ ] Tested in development environment
  • [ ] Go unit tests
  • [ ] Go integration tests
  • [ ] Tested via GitHub Actions

[!NOTE] Cursor Bugbot is generating a summary for commit cdf2b234e3dbbcebc789d58a88246ec767b31aaf. Configure here.

ws4charlie avatar Nov 12 '25 23:11 ws4charlie

[!IMPORTANT]

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This pull request introduces a batched TSS keysign system to improve EVM chain outbound performance. It refactors signing workflows to use per-nonce digest caching, batch multiple keysigns into a single TSS operation, and eliminates height-based scheduling in favor of a nonce-driven approach with stale-block detection.

Changes

Cohort / File(s) Summary
Cantor Pairing Utilities
pkg/math/pairing.go, pkg/math/pairing_test.go
Introduces Cantor pairing functions (CantorPair, CantorUnpair) and MaxPairValue constant for mapping uint32 pairs to uint64 values, with comprehensive round-trip testing.
Base Signer Batch Infrastructure
zetaclient/chains/base/signer.go, zetaclient/chains/base/signer_batch_info.go, zetaclient/chains/base/signer_batch_sign.go
Adds per-nonce TSS tracking (tssKeysignInfoMap, nextTSSNonce), introduces TSSKeysignInfo and TSSKeysignBatch structures, implements batch accumulation logic, readiness checks, signing workflows, and nonce-to-batch mapping utilities. Changes mu from Mutex to RWMutex for concurrent access.
EVM Signer Refactoring
zetaclient/chains/evm/signer/outbound_data.go, zetaclient/chains/evm/signer/outbound_data_test.go, zetaclient/chains/evm/signer/sign.go, zetaclient/chains/evm/signer/sign_test.go, zetaclient/chains/evm/signer/signer.go, zetaclient/chains/evm/signer/signer_admin.go, zetaclient/chains/evm/signer/signer_admin_test.go, zetaclient/chains/evm/signer/signer_test.go, zetaclient/chains/evm/signer/v2_sign.go
Removes height parameter from NewOutboundData and Sign operations. Replaces TSS-based signing with GetSignatureOrAddDigest flow, introducing ErrWaitForSignature for async keysign awaiting. Adds NextTSSNonce method. Updates test infrastructure with digest-based mocking and signature preloading for all signing paths.
EVM Chain Scheduler Refactoring
zetaclient/chains/evm/evm.go
Introduces scheduleKeysign method with batch preparation, readiness checks, and sequential batch signing. Refactors scheduleCCTX to use NextTSSNonce instead of tracker-based nonce heuristics, adds stale-block-event skipping logic, simplifies conflict checking. Removes getTrackerSet helper and tracker-based gating.
Client and Repository Interface Extensions
zetaclient/chains/tssrepo/client.go, zetaclient/chains/zrepo/client.go, zetaclient/chains/zrepo/zrepo.go, zetaclient/dry/dry.go, zetaclient/testutils/mocks/tss.go, zetaclient/testutils/mocks/zetacore.go, zetaclient/mode/chaos/generated.go, zetaclient/tss/service.go
Adds IsSignatureCached method to TSSClient interface and implementations (TSSService, mocks, chaos). Adds GetBlockHeight method to ZetacoreReaderClient interface and implementations (ZetaRepo, mocks, chaos). Updates mock parameter naming for clarity.
Metrics and Monitoring
zetaclient/metrics/metrics.go
Introduces NextTSSNonce gauge metric (per-chain) for observability of TSS account nonce state.
Test Performance and Configuration
cmd/zetae2e/local/performance.go, zetaclient/mode/chaos/generate/sample.json
Moves timer start in withdraw performance test to after deposit step, measuring only keysign execution time. Adds GetBlockHeight configuration entry to chaos generator.
Documentation
changelog.md
Documents new batch keysign feature for EVM performance improvement.
Observer Comments
zetaclient/chains/evm/observer/outbound.go
Adds explanatory comment about batch keysign usage and deprecation of continueKeysign flag.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler as EVM Scheduler
    participant Signer as Base Signer
    participant Batch as Batch Manager
    participant TSS as TSS Service
    participant Cache as Signature Cache
    
    Scheduler->>Signer: PrepareForKeysign(zetaHeight, nextNonce)
    Signer->>Signer: Check stale blocks
    Signer->>Signer: Clean stale keysign info
    Signer-->>Scheduler: Ready: bool
    
    alt Batch ready to sign
        Scheduler->>Signer: GetKeysignBatch(batchNumber)
        Signer->>Batch: Collect digests in nonce range
        Batch-->>Signer: TSSKeysignBatch
        Signer->>Signer: SignBatch(batch)
        Signer->>Signer: Compute keysignHeight via Cantor pairing
        Signer->>TSS: SignBatch(digests, height)
        TSS->>Cache: Store signatures
        Signer->>Signer: AddBatchSignatures(batch, sigs)
    else Waiting for signatures
        Signer->>Cache: GetSignatureOrAddDigest(nonce, digest)
        Cache-->>Signer: (sig [65]byte, found bool)
        alt Found in cache
            Signer-->>Scheduler: Success
        else Not found
            Signer-->>Scheduler: ErrWaitForSignature
        end
    end
sequenceDiagram
    participant Client as EVM Client
    participant Outbound as OutboundData
    participant TSS as Signer (TSS)
    participant Batch as Batch Signing
    
    rect rgb(200, 200, 255)
    note over Client,TSS: Old Flow: Height-based scheduling
    Client->>Outbound: NewOutboundData(ctx, cctx, height, logger)
    Outbound-->>Client: OutboundData with height field
    Client->>TSS: Sign(ctx, data, ..., height)
    TSS->>TSS: Per-nonce TSS keysign
    end
    
    rect rgb(200, 255, 200)
    note over Client,Batch: New Flow: Batch and digest-based
    Client->>Outbound: NewOutboundData(ctx, cctx, logger)
    Outbound-->>Client: OutboundData (height from ObservedExternalHeight)
    Client->>TSS: GetSignatureOrAddDigest(nonce, digest)
    alt Signature cached
        TSS-->>Client: (sig, true)
    else Waiting on batch
        TSS-->>Client: (empty, false)
        Client->>Batch: PrepareForKeysign()
        Batch->>Batch: Accumulate nonces
        Batch->>TSS: SignBatch(digests[])
        TSS->>TSS: Single TSS keysign for batch
        Batch->>Batch: Cache all signatures
    end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • Concurrency safeguards in batch accumulation: The new RWMutex usage in Signer and per-nonce map (tssKeysignInfoMap) require careful review to ensure lock ordering and prevent deadlocks during concurrent batch operations.
  • Cantor pairing correctness: The bidirectional mapping (NonceToBatchNumber, BatchNumberToRange) and KeysignHeight computation using Cantor pairing must be validated for round-trip correctness and absence of collisions.
  • Scheduler control flow changes: The replacement of tracker-based nonce heuristics with NextTSSNonce and introduction of stale-block-event detection fundamentally alters CCTX processing order and timing. Verify this does not break existing ordering guarantees or introduce race conditions.
  • Asynchronous signature awaiting: The new ErrWaitForSignature signal and GetSignatureOrAddDigest flow introduce eventual-consistency semantics. Verify retry logic, backpressure handling, and that outbounds are not silently dropped or duplicated.
  • Test infrastructure alignment: Digest-based mocking across multiple signing paths (sign_test.go, signer_admin_test.go, signer_test.go) must be consistent to avoid false negatives masking real signing failures.
  • Interface compliance: New interface methods (IsSignatureCached, GetBlockHeight) added to TSSClient and ZetacoreReaderClient require verification that all implementations (service, mocks, chaos, dry) are correctly updated.

Possibly related PRs

  • zeta-chain/node#2357: Involves base Signer and batch signing surface changes, including integration of signer fields and batch SignBatch method introduction with related refactoring.

Suggested reviewers

  • skosito
  • lumtis
  • brewmaster012
  • kingpinXD

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main architectural change: implementing a sequential batch TSS keysign scheduler for EVM chains, which is the primary objective of the changeset.
Description check ✅ Passed The description is comprehensive and covers the four key architectural changes, performance improvements, test results, and linked issue. However, testing coverage is incomplete: only localnet testing is checked while Go unit/integration tests and GitHub Actions are unchecked.
Linked Issues check ✅ Passed The changeset directly addresses issue #4436 by implementing batched keysign scheduling and deterministic height computation to reduce excessive TSS requests and improve throughput consistency.
Out of Scope Changes check ✅ Passed All changes remain within scope: EVM batch keysign implementation, supporting utilities (Cantor pairing, batch info structures), interface extensions (IsSignatureCached, GetBlockHeight), and related test updates. The changelog entry is documentation.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Nov 12 '25 23:11 coderabbitai[bot]

!!!WARNING!!! nosec detected in the following files: pkg/math/pairing.go, zetaclient/chains/base/signer_batch_info.go, zetaclient/chains/base/signer_batch_info_test.go, zetaclient/chains/base/signer_batch_sign.go, zetaclient/chains/base/signer_batch_sign_test.go, zetaclient/chains/evm/evm.go, zetaclient/tss/service.go

Be very careful about using #nosec in code. It can be a quick way to suppress security warnings and move forward with development, it should be employed with caution. Suppressing warnings with #nosec can hide potentially serious vulnerabilities. Only use #nosec when you're absolutely certain that the security issue is either a false positive or has been mitigated in another way.

Only suppress a single rule (or a specific set of rules) within a section of code, while continuing to scan for other problems. To do this, you can list the rule(s) to be suppressed within the #nosec annotation, e.g: /* #nosec G401 */ or //#nosec G201 G202 G203 Broad #nosec annotations should be avoided, as they can hide other vulnerabilities. The CI will block you from merging this PR until you remove #nosec annotations that do not target specific rules.

Pay extra attention to the way #nosec is being used in the files listed above.

github-actions[bot] avatar Nov 12 '25 23:11 github-actions[bot]

Codecov Report

:x: Patch coverage is 70.63712% with 106 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 64.85%. Comparing base (d8a6ec2) to head (afae2de). :warning: Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
zetaclient/chains/evm/evm.go 27.27% 35 Missing and 5 partials :warning:
zetaclient/chains/evm/signer/signer.go 24.00% 17 Missing and 2 partials :warning:
zetaclient/chains/base/signer.go 10.00% 9 Missing :warning:
zetaclient/chains/base/signer_batch_sign.go 94.30% 6 Missing and 3 partials :warning:
zetaclient/chains/evm/signer/v2_signer.go 0.00% 8 Missing :warning:
zetaclient/mode/chaos/generated.go 0.00% 8 Missing :warning:
zetaclient/chains/evm/signer/v2_sign.go 0.00% 4 Missing :warning:
zetaclient/chains/base/signer_batch_info.go 93.61% 1 Missing and 2 partials :warning:
zetaclient/chains/evm/signer/sign.go 75.00% 2 Missing :warning:
zetaclient/chains/zrepo/zrepo.go 0.00% 2 Missing :warning:
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4427      +/-   ##
===========================================
+ Coverage    64.71%   64.85%   +0.14%     
===========================================
  Files          469      472       +3     
  Lines        28574    28797     +223     
===========================================
+ Hits         18491    18677     +186     
- Misses        9064     9096      +32     
- Partials      1019     1024       +5     
Files with missing lines Coverage Δ
pkg/math/pairing.go 100.00% <100.00%> (ø)
pkg/scheduler/context.go 22.72% <100.00%> (ø)
pkg/scheduler/tickers.go 81.96% <100.00%> (ø)
zetaclient/chains/evm/observer/outbound.go 63.28% <ø> (ø)
zetaclient/chains/evm/signer/outbound_data.go 66.01% <100.00%> (-0.33%) :arrow_down:
zetaclient/chains/evm/signer/signer_admin.go 83.69% <100.00%> (-1.31%) :arrow_down:
zetaclient/metrics/metrics.go 68.08% <ø> (ø)
zetaclient/tss/service.go 50.26% <100.00%> (+1.93%) :arrow_up:
zetaclient/chains/evm/signer/sign.go 65.04% <75.00%> (-0.41%) :arrow_down:
zetaclient/chains/zrepo/zrepo.go 42.01% <0.00%> (-0.51%) :arrow_down:
... and 9 more
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Nov 22 '25 02:11 codecov[bot]