aptos-core icon indicating copy to clipboard operation
aptos-core copied to clipboard

[mempool] remove sequence_number_cache and instead store directly in transaction_store

Open bchocho opened this issue 3 years ago • 5 comments

Description

There were two issues previously:

  1. An account's sequence_number_cache could get GC'd before its transactions in mempool were sent to consensus. In this case because a cached sequence number does not exist, the transactions are never sent to consensus (get_batch checks the sequence number).
  2. The sequence_number_cache capacity is the same as the mempool, but it could have grown beyond this because sequence_number_cache entries are retained for accounts after transaction removal (until they expire). In this case old sequence_number_cache entries are removed and those transactions may never be sent to consensus (same reason as above).

There doesn't seem to be a need for a separate sequence_number_cache at all. When transactions are submitted to mempool, the latest account sequence number from state db is passed in. The argument is that this sequence number should never be stale compared to notifications coming in to mempool, as state db will be updated before notifications.

Thus we remove the problematic cache and simply store in the transaction_store. If there are no more transactions, the value is removed (same as the main transaction hashmap after #4041

Test Plan

Add test for case 2 that would fail in main. Otherwise rely on existing tests.


This change is Reviewable

bchocho avatar Sep 09 '22 22:09 bchocho

This is ready for review, but note it also contains changes in #4041 which is not yet merged.

@JoshLind @zekun000 I discussed the following with you offline:

There doesn't seem to be a need for a separate sequence_number_cache at all. When transactions are submitted to mempool, the latest account sequence number from state db is passed in. The argument is that this sequence number should never be stale compared to notifications coming in to mempool, as state db will be updated before notifications.

Would be great if you can think over whether this is really true (or any ideas how to test that)

bchocho avatar Sep 10 '22 01:09 bchocho

@davidiw

If mempool doesn't allow seq+1 then the user has to submit transactions one-by-one, waiting for E2E acknowledgement of commit in seq.

Every time there is an update to the account's sequence number (txn submission or commit), process_ready_transactions is called to update the priority index (where consensus pulls from) https://github.com/aptos-labs/aptos-core/blob/main/mempool/src/core_mempool/transaction_store.rs#L320-L329

bchocho avatar Sep 12 '22 17:09 bchocho

I meant should n+1 be allowed if n isn't in the current mempool.

davidiw avatar Sep 12 '22 20:09 davidiw

I think it makes sense in some scenarios. I honestly haven't decided whether it warrants the complexity.

  • Fullnodes behind a load balancer. If the load balancer routes consecutive sequence numbers to different fullnodes, allowing n+1 without n means the transactions should both succeed. (Without this feature, n+1 would fail. But, ideally the load balancer should be sticky.)
  • Transient API failures. If the client is sending a stream k of outstanding transactions, a single transient API failure will not require all k to be resent, only the failed one.

bchocho avatar Sep 12 '22 22:09 bchocho

Forge is running suite land_blocking on cce032024f7890952a389a89c613a6868e95d41d

Forge is running suite compat on testnet ==> cce032024f7890952a389a89c613a6868e95d41d

:white_check_mark: Forge suite land_blocking success on cce032024f7890952a389a89c613a6868e95d41d

performance benchmark with full nodes : 6732 TPS, 4429 ms latency, 6000 ms p99 latency,no expired txns
Test Ok

:white_check_mark: Forge suite compat success on testnet ==> cce032024f7890952a389a89c613a6868e95d41d

Compatibility test results for testnet ==> cce032024f7890952a389a89c613a6868e95d41d (PR)
1. Check liveness of validators at old version: testnet
compatibility::simple-validator-upgrade::liveness-check : 6991 TPS, 3362 ms latency, 4900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: cce032024f7890952a389a89c613a6868e95d41d
compatibility::simple-validator-upgrade::single-validator-upgrade : 5271 TPS, 3588 ms latency, 4350 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: cce032024f7890952a389a89c613a6868e95d41d
compatibility::simple-validator-upgrade::half-validator-upgrade : 6337 TPS, 5282 ms latency, 9350 ms p99 latency,no expired txns
4. upgrading second batch to new version: cce032024f7890952a389a89c613a6868e95d41d
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6010 TPS, 3378 ms latency, 4200 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet ==> cce032024f7890952a389a89c613a6868e95d41d passed
Test Ok

Forge is running suite compat on testnet ==> 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff

Forge is running suite land_blocking on 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff

:white_check_mark: Forge suite land_blocking success on 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff

performance benchmark with full nodes : 7281 TPS, 4084 ms latency, 5400 ms p99 latency,no expired txns
Test Ok

:white_check_mark: Forge suite compat success on testnet ==> 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff

Compatibility test results for testnet ==> 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff (PR)
1. Check liveness of validators at old version: testnet
compatibility::simple-validator-upgrade::liveness-check : 6936 TPS, 3970 ms latency, 6200 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff
compatibility::simple-validator-upgrade::single-validator-upgrade : 5966 TPS, 4350 ms latency, 5600 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff
compatibility::simple-validator-upgrade::half-validator-upgrade : 5679 TPS, 4876 ms latency, 6100 ms p99 latency,no expired txns
4. upgrading second batch to new version: 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6360 TPS, 3988 ms latency, 7900 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet ==> 1cd2928fb9ad35b05fb7781b15170b25cbe4b6ff passed
Test Ok

Forge is running suite compat on testnet ==> dce77a80c971a5210fedbcad5480d9a54c74b531

Forge is running suite land_blocking on dce77a80c971a5210fedbcad5480d9a54c74b531

:white_check_mark: Forge suite land_blocking success on dce77a80c971a5210fedbcad5480d9a54c74b531

performance benchmark with full nodes : 7637 TPS, 3892 ms latency, 5400 ms p99 latency,no expired txns
Test Ok

:white_check_mark: Forge suite compat success on testnet ==> dce77a80c971a5210fedbcad5480d9a54c74b531

Compatibility test results for testnet ==> dce77a80c971a5210fedbcad5480d9a54c74b531 (PR)
1. Check liveness of validators at old version: testnet
compatibility::simple-validator-upgrade::liveness-check : 6804 TPS, 3980 ms latency, 5600 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: dce77a80c971a5210fedbcad5480d9a54c74b531
compatibility::simple-validator-upgrade::single-validator-upgrade : 6303 TPS, 4278 ms latency, 5800 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: dce77a80c971a5210fedbcad5480d9a54c74b531
compatibility::simple-validator-upgrade::half-validator-upgrade : 5830 TPS, 4753 ms latency, 7400 ms p99 latency,no expired txns
4. upgrading second batch to new version: dce77a80c971a5210fedbcad5480d9a54c74b531
compatibility::simple-validator-upgrade::rest-validator-upgrade : 7181 TPS, 3757 ms latency, 5600 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet ==> dce77a80c971a5210fedbcad5480d9a54c74b531 passed
Test Ok

Forge is running suite land_blocking on 676f8cf020e19bdbd7ed174d254881e8b1ff87e1

Forge is running suite compat on testnet ==> 676f8cf020e19bdbd7ed174d254881e8b1ff87e1

:white_check_mark: Forge suite land_blocking success on 676f8cf020e19bdbd7ed174d254881e8b1ff87e1

performance benchmark with full nodes : 7533 TPS, 3945 ms latency, 6600 ms p99 latency,no expired txns
Test Ok

:white_check_mark: Forge suite compat success on testnet ==> 676f8cf020e19bdbd7ed174d254881e8b1ff87e1

Compatibility test results for testnet ==> 676f8cf020e19bdbd7ed174d254881e8b1ff87e1 (PR)
1. Check liveness of validators at old version: testnet
compatibility::simple-validator-upgrade::liveness-check : 7796 TPS, 3520 ms latency, 5400 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 676f8cf020e19bdbd7ed174d254881e8b1ff87e1
compatibility::simple-validator-upgrade::single-validator-upgrade : 5530 TPS, 4711 ms latency, 5900 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 676f8cf020e19bdbd7ed174d254881e8b1ff87e1
compatibility::simple-validator-upgrade::half-validator-upgrade : 5998 TPS, 4403 ms latency, 6900 ms p99 latency,no expired txns
4. upgrading second batch to new version: 676f8cf020e19bdbd7ed174d254881e8b1ff87e1
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6514 TPS, 4050 ms latency, 7600 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet ==> 676f8cf020e19bdbd7ed174d254881e8b1ff87e1 passed
Test Ok

github-actions[bot] avatar Sep 14 '22 23:09 github-actions[bot]