zos
zos copied to clipboard
Accidental deletion of a container

35722 https://explorer.testnet.grid.tf/api/v1/reservations/workloads/35722
Seems the container already got deleted 29th of april around 17:20 CEST. The logs on the node show that around this time, some new reservations were deployed and the flists were not in cache. This includes another container using the same flist. Since the container was supposedly running at the time, the flist should have been there. For some reason, the container exited, got restarted by the container daemon, and then failed because zinit was not found in the path. This further supports that the flist was no longer there. We will need to investigate what exactly caused this.
About 2 minutes later, daemons were restarting, though there does not seem to be an indication of an upgrade, which is possibly related
another one https://github.com/threefoldtech/itenv_testnet/issues/4
I need to clarify something first, a node can initiate a delete if it failed to start a workload, even if it has been running for some time. So basically an error that can crash the workload, or if the node was rebooted and couldn't bring the workload to it's running state it will get deleted since that is the only way to communicate an error with the owner. It's better than having it reported as deployed, but not actually running.
Also, this container got deleted on the 29th of April, so that is already a long time ago. any reason why it was only reported recently?
So now about what I think has happened: the logs from this time period shows that the machine was booting up, for some reason the node couldn't redeply the container (seems that it the flist mount was corrupt somehow) hence the container failed to start and cause the deletion.
I will have to look deeper into the logs to see what exactly happened to the flist mount
On other hand, the bot should recover by redeploying another container on a different node if suddenly this node is not reachable anymore.
After investigating the issue more and looking deeper on the state of the node I found the cause of this issue

The OOM decided to kill some of the running 0-fs processes, which caused the container mount to fail and hence the container itself.
Solutions of what we can do to avoid this in the future:
- Increase the container overhead, to account for the 0-fs process that runs the container
- Protect 0-fs processes against OOM by setting the oom priority of this process
Edit:
- NodeID:
8zPYak76CXcoZxRoJBjdU69kVjo7XYU1SFE2NEK4UMqn - Seems node has other issues with the hardware, it might not be a memory capacity planning issues that triggered OOM. The 2 containers that has been deleted randomly are running on this exact node
So this was caused by the following issues:
- The node itself had a physical problem with one of the disk, this has been replaced
- The oom has killed the 0-fs process for the container, this is completely random, so a PR has been open to make sure the 0-fs process is never selected by OOM #1272
@xmonader @muhamadazmy although this hard to verify, I saw similar behavior like the one described in the issue happens even after the fix was merged. see: https://github.com/threefoldtech/js-sdk/issues/3148 feel free to close the issue if you believe that it is a different issue.