juno icon indicating copy to clipboard operation
juno copied to clipboard

OOM Crashes on Juno Pod After Restart During Heavy Load

Open wojciechos opened this issue 1 year ago • 0 comments

Increased traffic targeting the starknet_call method on our k8s pod pushed CPU usage to 100%, leading to request failures and block sync issues. Subsequent restarts of the pod resulted in immediate OOM errors at startup. However, after applying a fresh database, the pod started to sync properly without any OOM issues which suggests that db has been corrupted(?).

image k8s Logs:

terminated
Reason: OOMKilled - exit code: 137
Started at: 2024-04-19T15:14:04+05:30
Finished at: 2024-04-19T15:14:51+05:30

Possible Causes:

  • Potential database corruption during restarts combined with high CPU load.
  • Recent Pebble updates

//UPDATE - 06.05.2024 Pod unable to keep up with syncing, resulting in failed requests due to reaching CPU limit. Actions taken: Added more pods, restarted pod, but no improvement. Resolution: Removing and replacing the DB resolved the issue. Next steps: Prioritize investigating and fixing the underlying cause.

06-05-2024-incident.pdf

wojciechos avatar Apr 19 '24 13:04 wojciechos