storage pre-allocate storage for metadata json files, see containers/podman#13967

This change seems to fix the scenario described in containers/podman#17198

I would appreciate advice on writing a testcase though, I would imagine that delete_container testcase would be a good starting point, however I lack the knowledge of the test environment and don't know if it is feasible to fill the filesystem or initialize a new filesystem and mount it.

Jan 25 '23 22:01 chilikk

@nalind @mtrmac @giuseppe @vrothberg @saschagrunert @flouthoc PTAL

Jan 26 '23 19:01 rhatdan

My initial impression is that this way too much complexity to handle a desperate situation that could be much more reliably handled by something like deleting a file in /var/log. There well may be several other places that require extra space to free large objects, and without a specific feature test to ensure that this keeps working, we are quite likely to add more of them without ever noticing.

I guess all of this code is likely to more frequently break something in unexpected / untested situations, than to help in situations it is intended for. E.g. the non-Linux path is almost certainly not going to be sufficiently tested, and the behavior difference is likely to be missed by future readers. Admittedly that is a very vague and completely unquantified feeling.

So this is not quite a “no” … but it is almost one.

I can totally see your point regarding the complexity.

However, I do think that this is a useful feature to be able to remove specific containers/images once the disk is full, rather than the discussed alternative to perform a system reset dropping all containers with their state, volumes etc. It might not matter that much in a desktop environment (and could be fixed by freeing some less important files, as suggested, because the containers storage would often be located on a root partition), but in a production environment it is desirable to be able to limit the amount of affected services.

From what I have been able to gather so far Podman needs additional space on podman rm when 1) updating the Bolt database 2) updating the containers.json, layers.json and mountpoints.json. Assuming /var/lib/containers is already located on a dedicated partition, it is possible to mitigate (1) by setting up a dedicated partition for /var/lib/containers/storage/libpod. However, such solution is not applicable to (2) because the *.json files are located in the same directory as the actual data they describe.

Moreover, an atomic write of the *.json-files as it is implemented now requires roughly as much free space as the size of the original file which can easily go up to 100s of kBs even with a handful of containers. This means that the space requirements for such operation are much larger than a Bolt DB update meaning the likelihood of encountering an out of disk situation when the disk is low is much higher on metadata *.json atomic writes than updating the BoltDB.

By keeping a temporary file of the same size around we can guarantee that an atomic write of a smaller file (which is the case with podman rm) will succeed with regards to disk space. We cannot always be sure that a temporary file of the same size can always be allocated at write, so this is not a bullet-proof solution, but it still greatly reduces the probability of ending up in a situation where the containers and all their data must be re-created from scratch.

So, while I agree that one would try to avoid the complexity, I think it might be justified if it allows to solve the actual problem. Running out of disk space might not be a very often encountered scenario, but when it happens it can be very frustrating.

Jan 26 '23 21:01 chilikk

My initial impression is that this way too much complexity to handle a desperate situation that could be much more reliably handled by something like deleting a file in /var/log. There well may be several other places that require extra space to free large objects, and without a specific feature test to ensure that this keeps working, we are quite likely to add more of them without ever noticing.

I guess all of this code is likely to more frequently break something in unexpected / untested situations, than to help in situations it is intended for. E.g. the non-Linux path is almost certainly not going to be sufficiently tested, and the behavior difference is likely to be missed by future readers. Admittedly that is a very vague and completely unquantified feeling.

So this is not quite a “no” … but it is almost one.

I agree with @mtrmac. There are so many other places where it could go wrong that I am not sure it is worth to add all this complexity that is difficult to test just for one instance.

Jan 27 '23 14:01 giuseppe

Or maybe … if we are serious about this, there really needs to be an end-to-end Podman tests that ensures the actually desired functionality (removing a container when completely out of disk space) keeps working.

(I realize that this is a bit circular — we need this PR, and maybe other changes, before that test can be added. So, that means either doing all that work, and then discussing whether it is worth maintaining it, or discussing that now, with some hypothetical guess at how much work that is.)

Jan 27 '23 15:01 mtrmac