sui
sui copied to clipboard
Validator consumes too much resources and that makes the client stalled
I faced the client stalled problem during wave 2 validator participation.
sui-node
stalled for 2 minutes and it recovered automatically.
When I faced this issue, CPU and disk write resources consume a lot as I attached below.

During this short time, the client's metrics were not collected from the Prometheus agent so the client could be really busy for running or handling some jobs.
Our operating environments are,
- Ubuntu 22.04.1 LTS
- Using Docker
And I also attached logs. This issue happened from 14:17 to 14:19 (UTC).
A similar problem happened so I attached other logs below. It occurred from 00:50 ~ 01:00 UTC. I hope this will be useful to fix.
Thanks for reporting the issue. After recovery, Sui was spending CPU and disk to catch up, specifically for checkpoint download and execution. The 2min stall was also interesting, but we did not find a root cause.