Ability to reject requests if too many transactions pending

Open letmaik opened this issue 3 years ago • 1 comments

Whenever too many transactions are PENDING, then there is the risk of running out of memory since those transactions are kept in memory. Getting into this state may be because of a combination of factors, like a spike in requests and temporary slow I/O.

To avoid an out-of-memory (meaning, node crash) situation a common solution is to send back 503 to clients until the service is less busy.

This issue discusses what the metric should be that determines the busy state.

A generic metric may be to keep track of currently used process memory and let the operator configure a threshold after which requests are rejected. This metric is more general and would apply to any situation that leads to high memory consumption (including historic query cache, indexing). Getting the required process memory might be possible with snmalloc statistics but would have to be validated in different environments (SGX, virtual). This seems only viable if the allocator gives pages back to OS.
Another more direct metric may be to keep track of pending transactions and let the operator configure a threshold based on that. This threshold would be higher than the ledger_signatures.tx_count configuration value. When using this metric then other sources of memory consumption have to be considered separately, like the historical query cache, indexing, and endpoint function code.

Requests to the operator (/node/) and governance (/gov/) endpoints should likely never be rejected to ensure that the network can still be managed and observed.

There is another question whether read-only endpoints should be affected the same as write endpoints, since the former do not generate transactions. However, they might still influence memory consumption through the historical query cache, indexing, and anything else the endpoint function does. My feeling is that it is not worth treating them differently.

I think metric 2 is preferrable for the time being because it is more portable and easier to implement. It also forces the developer/operator to think more carefully about the other components consuming memory. Metric 1 may lead to a situation where a node gets stuck in high memory usage because the historical query cache or indexing was badly configured.

May 20 '22 13:05 letmaik

@letmaik let me know if you want more stats in snmalloc. It is a bit easier to track things now with 0.6.0. I can add more stuff it is would be helpful.

May 20 '22 19:05 mjp41