sui icon indicating copy to clipboard operation
sui copied to clipboard

Validator consumes too much resources and that makes the client stalled

Open rootwarp opened this issue 2 years ago • 1 comments

I faced the client stalled problem during wave 2 validator participation.

sui-node stalled for 2 minutes and it recovered automatically. When I faced this issue, CPU and disk write resources consume a lot as I attached below.

Screenshot 2023-02-07 at 23 20 34

During this short time, the client's metrics were not collected from the Prometheus agent so the client could be really busy for running or handling some jobs.

Our operating environments are,

  • Ubuntu 22.04.1 LTS
  • Using Docker

And I also attached logs. This issue happened from 14:17 to 14:19 (UTC).

sui-node.2023020714_1.log.zip

rootwarp avatar Feb 07 '23 15:02 rootwarp

A similar problem happened so I attached other logs below. It occurred from 00:50 ~ 01:00 UTC. I hope this will be useful to fix.

sui-node.2023021000_3.log.tar.gz

rootwarp avatar Feb 10 '23 02:02 rootwarp

Thanks for reporting the issue. After recovery, Sui was spending CPU and disk to catch up, specifically for checkpoint download and execution. The 2min stall was also interesting, but we did not find a root cause.

mwtian avatar Mar 02 '23 02:03 mwtian