sui
sui copied to clipboard
Investigate validator memory increase
Relevant discussions: https://mysten-labs.slack.com/archives/C02GD7J9HUM/p1656435768898279 https://mysten-labs.slack.com/archives/C03GKUEA5PF/p1656639974689169
Insights
From investigations in staging + memory profiling, at a high level, we have a few insights about a couple of suspected issues which amplify each other:
-
a Narwhal node restart lead to a storm of requests for batches & certificates (A). That storm is overly large largely because the Narwhal node doesn't have crash-recovery yet (A.1), and because of the speed at the restarted node reissues its requests (A.2). This leads to a large amount of network messages in response. Those many messages are implemented in a way that's over-reliant on the tokio executor and consume memory at the responders (B).
-
Historically, as well as now with gRPC, node-to-node connections are "1-way" in that node a -> node b is one connection and node b -> node a is a different one. When a restarts, this can lead to an issue where a can see b but b can't see a : there's a pile-up of pending messages at b (C).
Resources
Repro instructions for problem (1.): https://gist.github.com/mystenmark/1c923ae86665f666595f359336feedb4
Mitigations: https://github.com/MystenLabs/narwhal/pull/462 (A.2) https://github.com/MystenLabs/narwhal/pull/463 (B) https://github.com/MystenLabs/narwhal/pull/465 (A)
Better operations (logging, profiling, observability): https://github.com/MystenLabs/sui/pull/2984 https://github.com/MystenLabs/narwhal/pull/461 https://github.com/MystenLabs/narwhal/pull/426
Follow-up work
- confirm problem (C) through packet-capture, and investigate shortening the TTL of connections in mysten-network to mitigate,
- Resolve (A.1) through resolving https://github.com/MystenLabs/sui/issues/5200,
- continue the line of work on memory profiling,
- run staging with the above mitigations and confirm they significantly relieve the issue.
@huitseeker and @velvia can update latter
Update July 18:
- the validator memory increase seen when restarted nodes is fixed as of 0.6.0 (A.2, B are solved)
- problem (A.1) has a prototype fix slated for 0.6.1, undergoing testing.
- problem (C) is still in the same status (@tharbert to update)
- the gradual memory increase in validators (unrelated to restarts) is not reproducible without load. We will need to instrument the nodes to profile in production, the follow-up is https://github.com/MystenLabs/sui/issues/2974. @velvia to document status here.
The next step here is to generate load and try to reproduce the "gradual memory increase". We likely need to do this in a deployed environment.
The goal is to shorten as much as possible the time to reproduce when we run out of memory. This theory works if memory increase is tied to transaction activity, and we can inject more transactions. If it is a time thing then we would need more time to reproduce.
If it is tied to specific events still such as restarts and failures, then we need to simulate those.
Steps:
- Check current status and how much load can be generated locally
- Look into remote load generator, make sure it can run for long periods of time
- Look into deploying a load generator + devnet setup, with telemetry and profiling
- Run load and try to simulate failures and memory issues (and see what other issues there are)
fullnode is worth looking into as well
I think a lot of the recent issues have been driven entirely by the handle_batch_streaming implementation + bugs in fullnode that make them call that method too frequently.
Fixes for both issues should be ready this week.
Just to give an update after lots of work in this area:
- Bytehound memory profiling combined with other metrics was showing the problem area is related to networking, specifically the follower API and the high number of followers in the prod dev.net environment
- Mark made a change which reduced the batch size and helped decrease problems
- Some memory flare-ups still observable when there is a jump in load, as well as at other odd times. There is still a steady memory increase but it is much slower
- Bytehound won't work going forward because of incompatibilities with some latest dependencies like RocksDB (needs
aligend_alloc()API call which Bytehound doesn't support)
The remaining work is lower priority, to investigate remaining memory leaks and determine where they come from.
I think we can close this one out, been tons of progress on this ticket, including tons of new metrics including RocksDB, plus work on profilers, experiments, etc. We have a much better handle on the use of memory.
RocksDB block cache is a big source. There are many temporary sources of high memory usage, include network buffers and NodeSyncState.