aptos-core icon indicating copy to clipboard operation
aptos-core copied to clipboard

Downstream mempools are unaware if a transaction is discarded upstream

Open zekun000 opened this issue 3 years ago • 5 comments

Failed txn (without enough gas) is stuck in mempool after it's been notified by consensus.

from logs, https://github.com/aptos-labs/aptos-core/blob/74e312ad2b19adfafb5606ee7066edab4ea053b1/mempool/src/core_mempool/transaction_store.rs#L400 is triggered.

However the txn is not cleared in the mempool. (mempool size is still 1) http://mon.dev.testnet.aptoslabs.com/grafana/d/mempool/mempool?orgId=1&from=1646460874257&to=1646461407666

zekun000 avatar Mar 05 '22 06:03 zekun000

So, through my testing. It looks like the mempool txn is cleared after some time. It's possible that it's resubmitted from other nodes into the mempool which keeps making it reappear (the metrics are periodic).

However, nothing clears the full nodes from these txns, so we'd need a response downstream to full nodes, but that could be messy considering it's another spot to get wrong.

gregnazario avatar Mar 10 '22 17:03 gregnazario

@zekun000 is this at the validator or fullnode? Is it out of gas specifically where the prologue would fail? Assuming the latter part, then I would like to propose the following:

I think that this might be something to initiate more on the client side and not necessarily on the validator, consider the following:

  • Client performs a check on a transaction
  • If there's a transaction and pending, re-run prologue
  • If prologue error, clean up txn
  • Send user either the no txn, pending, prologue error, etc

Similarly upon transaction submission, if there's an existing transaction in mempool, we can evaluate if the current one still would pass prologue and clean it up if not. We can then insert the new one if it would or we return the appropriate error. This way this new one should propagate upwards -- though you know, I don't think we ever tested the concept of a user replacing a txn even with the quirks of the current code.

, but I think one thing that may make sense is to make this a matter for the client / fullnode to figure out. Namely, if a user checks on the status of a "consensus-ready" transaction, it could be validated that it is still capable of being run on the current storage state. With eventual consistency in mind, this would seem to be much more performant means than trying to route information across the distributed mempool. The other thing that

davidiw avatar Mar 19 '22 20:03 davidiw

@davidiw the description is probably confusing since that's before I added the gas check in progloue.

the case we want to fix is that the txn passes prologue checks but fails in the actual execution so the mempool only gets notified by consensus and has no way to propagate this info to the network.

zekun000 avatar Mar 20 '22 21:03 zekun000

Per discussion offline, this is somewhat inclusive of just improving the end-to-end flows of transactions from insert at REST / mempool to the consensus validator network until it is either committed or discarded. At the same time, most of the reasons for a transaction being discarded are due to an expected narrow usage and other distributed systems will probably have less signal than ours.

davidiw avatar Mar 21 '22 16:03 davidiw

This issue is stale because it has been open 45 days with no activity. Remove the stale label or comment - otherwise this will be closed in 15 days.

github-actions[bot] avatar Jan 23 '23 02:01 github-actions[bot]

This issue is stale because it has been open 45 days with no activity. Remove the stale label or comment - otherwise this will be closed in 15 days.

github-actions[bot] avatar Apr 28 '23 01:04 github-actions[bot]