sui icon indicating copy to clipboard operation
sui copied to clipboard

Investigate validator memory increase

Open lxfind opened this issue 3 years ago • 4 comments

Relevant discussions: https://mysten-labs.slack.com/archives/C02GD7J9HUM/p1656435768898279 https://mysten-labs.slack.com/archives/C03GKUEA5PF/p1656639974689169

lxfind avatar Jul 01 '22 04:07 lxfind

Insights

From investigations in staging + memory profiling, at a high level, we have a few insights about a couple of suspected issues which amplify each other:

  1. a Narwhal node restart lead to a storm of requests for batches & certificates (A). That storm is overly large largely because the Narwhal node doesn't have crash-recovery yet (A.1), and because of the speed at the restarted node reissues its requests (A.2). This leads to a large amount of network messages in response. Those many messages are implemented in a way that's over-reliant on the tokio executor and consume memory at the responders (B).

  2. Historically, as well as now with gRPC, node-to-node connections are "1-way" in that node a -> node b is one connection and node b -> node a is a different one. When a restarts, this can lead to an issue where a can see b but b can't see a : there's a pile-up of pending messages at b (C).

Resources

Repro instructions for problem (1.): https://gist.github.com/mystenmark/1c923ae86665f666595f359336feedb4

Mitigations: https://github.com/MystenLabs/narwhal/pull/462 (A.2) https://github.com/MystenLabs/narwhal/pull/463 (B) https://github.com/MystenLabs/narwhal/pull/465 (A)

Better operations (logging, profiling, observability): https://github.com/MystenLabs/sui/pull/2984 https://github.com/MystenLabs/narwhal/pull/461 https://github.com/MystenLabs/narwhal/pull/426

Follow-up work

  • confirm problem (C) through packet-capture, and investigate shortening the TTL of connections in mysten-network to mitigate,
  • Resolve (A.1) through resolving https://github.com/MystenLabs/sui/issues/5200,
  • continue the line of work on memory profiling,
  • run staging with the above mitigations and confirm they significantly relieve the issue.

huitseeker avatar Jul 08 '22 23:07 huitseeker

@huitseeker and @velvia can update latter

lxfind avatar Jul 18 '22 15:07 lxfind

Update July 18:

  • the validator memory increase seen when restarted nodes is fixed as of 0.6.0 (A.2, B are solved)
  • problem (A.1) has a prototype fix slated for 0.6.1, undergoing testing.
  • problem (C) is still in the same status (@tharbert to update)
  • the gradual memory increase in validators (unrelated to restarts) is not reproducible without load. We will need to instrument the nodes to profile in production, the follow-up is https://github.com/MystenLabs/sui/issues/2974. @velvia to document status here.

huitseeker avatar Jul 18 '22 15:07 huitseeker

The next step here is to generate load and try to reproduce the "gradual memory increase". We likely need to do this in a deployed environment.

The goal is to shorten as much as possible the time to reproduce when we run out of memory. This theory works if memory increase is tied to transaction activity, and we can inject more transactions. If it is a time thing then we would need more time to reproduce.

If it is tied to specific events still such as restarts and failures, then we need to simulate those.

Steps:

  1. Check current status and how much load can be generated locally
  2. Look into remote load generator, make sure it can run for long periods of time
  3. Look into deploying a load generator + devnet setup, with telemetry and profiling
  4. Run load and try to simulate failures and memory issues (and see what other issues there are)

velvia avatar Jul 29 '22 17:07 velvia

fullnode is worth looking into as well

lxfind avatar Aug 01 '22 15:08 lxfind

I think a lot of the recent issues have been driven entirely by the handle_batch_streaming implementation + bugs in fullnode that make them call that method too frequently.

Fixes for both issues should be ready this week.

mystenmark avatar Aug 15 '22 15:08 mystenmark

Just to give an update after lots of work in this area:

  • Bytehound memory profiling combined with other metrics was showing the problem area is related to networking, specifically the follower API and the high number of followers in the prod dev.net environment
  • Mark made a change which reduced the batch size and helped decrease problems
  • Some memory flare-ups still observable when there is a jump in load, as well as at other odd times. There is still a steady memory increase but it is much slower
  • Bytehound won't work going forward because of incompatibilities with some latest dependencies like RocksDB (needs aligend_alloc() API call which Bytehound doesn't support)

The remaining work is lower priority, to investigate remaining memory leaks and determine where they come from.

velvia avatar Aug 15 '22 15:08 velvia

I think we can close this one out, been tons of progress on this ticket, including tons of new metrics including RocksDB, plus work on profilers, experiments, etc. We have a much better handle on the use of memory.

RocksDB block cache is a big source. There are many temporary sources of high memory usage, include network buffers and NodeSyncState.

velvia avatar Nov 02 '22 18:11 velvia