cri-o icon indicating copy to clipboard operation
cri-o copied to clipboard

Cri-o stuck in "Could not restore" when node load is high when restart cri-o

Open lance5890 opened this issue 4 months ago • 0 comments

What happened?

in one node, when system load is high, and then we restart cri-o, the cri-o stuck in Restore process for a long time, the logs show as follows:

Oct 15 13:32:42 master-lharm-2 crio[3480]: time="2024-10-15 13:32:42.903497855+08:00" level=warning msg="Could not restore sandbox e456015ab35e79331beb0071d58ff312747bee64056d7abd4f746358a7401712: failed to Statfs \"/var/run/netns/f6300a46-4b30-4a3d-8c06-5d2bb5c67905\": no such file or directory"
Oct 15 13:32:43 master-lharm-2 crio[3480]: time="2024-10-15 13:32:43.339161797+08:00" level=warning msg="Deleting all containers under sandbox e456015ab35e79331beb0071d58ff312747bee64056d7abd4f746358a7401712 since it could not be restored"
Oct 15 13:33:15 master-lharm-2 crio[3480]: time="2024-10-15 13:33:15.733919215+08:00" level=warning msg="Could not restore sandbox e85249f52bd82fab8b187e5e6ff0e7f9f5e9244a12523baa971be8ba5d36df00: failed to Statfs \"/var/run/netns/202e8db9-b93d-4576-9e39-5b6589aef158\": no such file or directory"
Oct 15 13:33:16 master-lharm-2 crio[3480]: time="2024-10-15 13:33:16.327566397+08:00" level=warning msg="Deleting all containers under sandbox e85249f52bd82fab8b187e5e6ff0e7f9f5e9244a12523baa971be8ba5d36df00 since it could not be restored"
Oct 15 13:33:53 master-lharm-2 crio[3480]: time="2024-10-15 13:33:53.088736014+08:00" level=warning msg="Could not restore sandbox ea5dffa43cc58888f331c4542f7fa02fd87ce6e8722c018701f34adb3bbf2e4c: failed to Statfs \"/var/run/netns/8cee8b58-faca-4159-bae6-44119e7bfb7c\": no such file or directory"
Oct 15 13:33:53 master-lharm-2 crio[3480]: time="2024-10-15 13:33:53.471422710+08:00" level=warning msg="Deleting all containers under sandbox ea5dffa43cc58888f331c4542f7fa02fd87ce6e8722c018701f34adb3bbf2e4c since it could not be restored"
Oct 15 13:34:16 master-lharm-2 crio[3480]: time="2024-10-15 13:34:16.940943535+08:00" level=warning msg="Could not restore sandbox 92d39c21bd77d349068f1f6f8379267c40e77e4ebd981f5828c1ddbdf2662162: failed to Statfs \"/var/run/netns/fa2aacd2-986b-458c-96ef-2a4e231a00d2\": no such file or directory"
Oct 15 13:34:17 master-lharm-2 crio[3480]: time="2024-10-15 13:34:17.518848482+08:00" level=warning msg="Deleting all containers under sandbox 92d39c21bd77d349068f1f6f8379267c40e77e4ebd981f5828c1ddbdf2662162 since it could not be restored"
Oct 15 13:34:47 master-lharm-2 crio[3480]: time="2024-10-15 13:34:47.982638318+08:00" level=warning msg="Could not restore sandbox b23e853d33f32b930dc718be396ab2a632647979a76dabed6322a1c59fe2104d: failed to Statfs \"/var/run/netns/af5f5567-28a0-43fd-9072-32b7db1697d2\": no such file or directory"
Oct 15 13:34:48 master-lharm-2 crio[3480]: time="2024-10-15 13:34:48.174033605+08:00" level=warning msg="Deleting all containers under sandbox b23e853d33f32b930dc718be396ab2a632647979a76dabed6322a1c59fe2104d since it could not be restored"
Oct 15 13:35:14 master-lharm-2 crio[3480]: time="2024-10-15 13:35:14.167490731+08:00" level=warning msg="Could not restore sandbox a71024afae081939f8ddd2f386240de5fb1827bfab1c20319fbb72fdeeef398d: failed to Statfs \"/var/run/netns/787418dc-88a8-4b4e-a319-166232202cf6\": no such file or directory"
Oct 15 13:35:14 master-lharm-2 crio[3480]: time="2024-10-15 13:35:14.606607516+08:00" level=warning msg="Deleting all containers under sandbox a71024afae081939f8ddd2f386240de5fb1827bfab1c20319fbb72fdeeef398d since it could not be restored"
ls /var/lib/containers/storage/overlay-containers | wc -l
5785

What did you expect to happen?

even when the node has high system load, The cri-o could not stuck in the Restoring process for a long time

How can we reproduce it (as minimally and precisely as possible)?

in the high system load, create many pods

Anything else we need to know?

No response

CRI-O and Kubernetes version

$ crio --version
# paste output here

1.25.8

$ kubectl version --output=json
# paste output here

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
5.15.131-3
# paste output here

Additional environment details (AWS, VirtualBox, physical, etc.)

physical

lance5890 avatar Oct 15 '24 05:10 lance5890