gvisor
gvisor copied to clipboard
Docker's SizeRw does not get updated by runsc
Description
Docker's inspect has a field called SizeRw
(https://docs.docker.com/reference/cli/docker/inspect/#size) that tracks the amount of bytes changed compared to the base image. This field does not seem to be properly handled by runsc. While doing things like inserting files into the container seems to still properly update this, even when running runsc, having programs write to disk does not seem to be properly tracked by runsc
Steps to reproduce
Use this Python script:
import os
# Function to write 10MB of data to disk
def write_data_to_disk(filename):
data = b'0' * 10 * 1024 * 1024 # 10MB of data
try:
with open(filename, 'wb') as file:
file.write(data)
print("Data written successfully.")
except OSError as e:
print("Error: There isn't enough space on disk to write the data.")
exit(1)
if __name__ == "__main__":
write_data_to_disk("test_data.bin")
In this container:
FROM python:3.12.2-alpine3.19
COPY write_to_disk.py /write_to_disk.py
CMD ["python", "/write_to_disk.py"]
After having run the container, you can observe the SizeRw using docker inspect --size <container_id>
, where you can retrieve the container id from docker ps -a
. The field SizeRw
of the response then should be around 10MB, which it is not for runsc.
runsc version
runsc version release-20240401.0
spec: 1.1.0-rc.1
docker version (if using docker)
Docker version 26.0.0, build 2ae903e
uname
Linux 6.5.0-26-generic #26-Ubuntu SMP PREEMPT_DYNAMIC x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response
My guess: gVisor is using an internal overlay filesystem, so writing data modifies process memory instead of the host filesystem. And when docker inspect
runs, it's getting SizeRw
from the host (maybe the size of the container's mount namespace). You might be able to get an accurate size from within gVisor (not sure whether we track it), but docker inspect
will unfortunately not know how to get that number anyways.
Might there be any way that we could write data in a way that docker inspect would see it? Like some overlay filesystem configuration?
I've walked through Docker's code that computes this field and it appears to eventually end up in this calcSize
function which adds up the size of files on the host filesystem. Since gVisor's "top" part of the overlay lives in gVisor memory only, it's not on the host filesystem anywhere, so there's no way for this Docker code to count it, short of changing gVisor to actually write the overlay contents to the host filesystem (which would reduce I/O performance, and use extra disk space for no reason other than accounting).
If you do want the top overlay layer to live on the host filesystem, you can set the overlay outside of gVisor or Docker, and then expose that to the sandbox as a bind mount. (Of course Docker won't know about it either, but you can then manually track usage because you know where the top of the overlay is.)
Since gVisor's "top" part of the overlay lives in gVisor memory only, it's not on the host filesystem anywhere, so there's no way for this Docker code to count it
The overlay upper layer (gVisor-internal tmpfs) has a file backend (called "filestore") which lives on the host and hence is scannable by Docker. See "Self-Backed Overlay" section in https://gvisor.dev/blog/2023/05/08/rootfs-overlay/.
The overlay filestore is basically a really large-sized file which holds all the pages used by the upper layer. It is a sparse file (it is empty and is populated on demand). When the application creates a new file and writes to it, the size of the filestore file does not change, but its disk usage does. This observable by looking at stat.Blocks
. stat.Size
remains the same. The filestore file is resized only when stat.Blocks == stat.Size
and more file size is needed for further allocations.
I think the issue is that the calcSize
function is using stat.Size
to calculate disk usage. It should use stat.Blocks
. This is the same issue that occurred in containerd and was fixed by using stat.Blocks
: https://github.com/containerd/continuity/commit/bc5e3edd2b742c38c762d928f267ad82922a1b63. So this need to be fixed in Moby.
While the above are definitely good points, for anyone who runs into similar issues, setting --overlay2=none
on runsc makes Docker properly track SizeRw again. Of course this has some performance caveats, but for our usecase this was actually quite a fitting resolution as we want our container disk performance to mimick real disk performance as much as possible
Actually https://docs.docker.com/reference/cli/docker/inspect/#size intends to track the size of files, not disk usage. So it is unclear whether a similar containerd fix (https://github.com/containerd/continuity/commit/bc5e3edd2b742c38c762d928f267ad82922a1b63) should be applied in this case as well. In the containerd case, we wanted disk usage (as you can see, the functions that were updated were diskUsage()
and diffUsage()
. And the disk usage stats were being used to impose storage limits. But Docker does not document that it wants the disk usage.
Not sure how useful the size stats are in themselves (since as described above you can have sparse files which can make the container filesystem look really large). But assuming Docker wants that, the --overlay2=
setting breaks those size stats. By default, --overlay2=root:self
. As mentioned above, if docker inspect --size
is important to you, --overlay2=none
will turn off the overlay optimizations and restore correct behavior for size stats.
A friendly reminder that this issue had no activity for 120 days.