Cache should convert to RAW when needed
Many/most distro images are distributed in QCOW2 format, which is not supported by VZ.
So every time a VZ instance is started the image is converted into RAW format first. Which takes just a few seconds, but also increases disk space usage:
Normally the file would be copied via clonefile and not take up extra space. Any further changes would be written just to the copy via the copy-on-write mechanism, but the shared base image would only exist once.
The savings become even larger when the user runs multiple instances using the same image:
Normally each additional image would use another clonefile and not use extra space. But since we are converting to RAW separately for each instance, they are not even using clones of each other, but new identical conversions of the original image.
Assume QCOW2 image is 500MB and you create 2 RAW instances: total space needed is 500MB for the QCOW2 version in the cache, plus 500MB for each instance, for a total of 1.5GB
If we had a 500MB RAW image in the cache then total space needed would be 500MB for the cache instance, plus pocket-change for the meta data of the 2 clones, saving 1GB of disk.
I suggest that the request to the cache download should include the desired format.
- The desired format is already in the cache: return a copy/clone of it
- The image is in a different format in the cache: convert to the desired format, but keep original
- Download image from source, return it if already in desired format
- Otherwise convert to desired format, then delete the download
Case 2 can happen if the user used the image with QEMU before and now wants to create a VZ instance. It makes sense to keep the original format in case they want to create another QEMU instance. It is also possible that the QEMU instance still exists, in which case deleting the cache copy would not clear up any space because it is just a clone.
Case 4 is the case when a user only uses the default VZ driver. They will only need to keep the desired (RAW) version in the cache. In case they later need a QCOW2 version, they will have to download it again. But from then on both versions will be in the cache.
The cache needs to retain the checksum of the original image and not use the checksum of the converted copy for lookup.
The cache needs to retain the checksum of the original image
The cache is indexed by URL, so it needs some kind of mechanism to refer to a converted copy of the image...
There was a similar situation before, with compressed images. We end up with two interesting checksums:
-
one is the original download, with the compression and timestamps and stuff of that "wrapper"
-
one is the actual contents, which might be interesting when checking for cache integrity etc
I also thought it would be interesting to provide the size(s) up front, so that you knew without having to check?
- https://github.com/lima-vm/lima/issues/1586
If we are going to "extend" the cache layout again, maybe both features* could be added?
* But the need to cache the uncompressed image is much less with .zst than it was with .xz
The container drivers are doing something similar, but it is considered internal to the drivers.
Preferrably they should use the instance dir (like WSL2), but others do it "internally" (AC/DC)
So we keep the original tarball (that we downloaded) as the "basedisk", and it can be linked/cloned
But use some driver-internal image/snapshot when actually running the thing (similar to diffdisk).
Actual commands:
-
wsl.exe --import(Import any Linux distribution to use with WSL) vhdx in the instance directory -
container build(using FROM in Dockerfile, this is a workaround) ext4 in the application directory,~/Library/Application\ Support/com.apple.container -
docker importsnapshots in the image store,/var/lib/docker
Assume QCOW2 image is 500MB and you create 2 RAW instances: total space needed is 500MB for the QCOW2 version in the cache, plus 500MB for each instance, for a total of 1.5GB
If we had a 500MB RAW image in the cache then total space needed would be 500MB for the cache instance, plus pocket-change for the meta data of the 2 clones, saving 1GB of disk.
This is not only disk space but memory used to cache the image, if the vm does disable the page cache (e.g. cache=none in qemu).
I suggest that the request to the cache download should include the desired format.
- The desired format is already in the cache: return a copy/clone of it
- The image is in a different format in the cache: convert to the desired format, but keep original
- Download image from source, return it if already in desired format
- Otherwise convert to desired format, then delete the download
This seems too complicated and I don't see any benefit.
The cache needs to retain the checksum of the original image and not use the checksum of the converted copy for lookup.
If you want to verify the integrity of the raw image, you need to keep also a checkup of the raw image. This can be computed by while converting the image to raw format using qcow2reader.
I suggest this behavior:
- When downloading an image, if the image is in uncompressed raw format, convert the image to raw format and delete the downloaded image.
- When creating a raw disk, clone the cache image to the instance directory
- When creating a qcow2 disk create an overlay using the raw image as backing file
This keeps instances isolated - nothing is shared, and deleting cached images is always safe. Creating a new disk takes no time or space. Memory used for caching images is minimized since instances share some of the blocks in the image.
This is not only disk space but memory used to cache the image, if the vm does disable the page cache (e.g. cache=none in qemu).
I don't understand what that means.
I suggest this behavior:
- When downloading an image, if the image is in uncompressed raw format, convert the image to raw format and delete the downloaded image.
I don't understand what this means either. Why do you need to convert a raw image to a raw image? Do you mean to compress it? How would that be useful? And why would you want to delete the uncompressed raw image?
- When creating a raw disk, clone the cache image to the instance directory
That's what we already do when you have a raw image in the cache.
- When creating a qcow2 disk create an overlay using the raw image as backing file
If I understand this correctly, you mean we should never keep a qcow2 image, but always convert to raw? Why would we then still want to create a qcow2 image, what would be the purpose?
I don't really understand your suggestion, so maybe I totally got it wrong, but is the idea:
- The cache only has uncompressed raw images.
- If the source is a qcow2 image, it will be converted to a raw image after download.
And that's basically it; we never ever deal with qcow2 except for the one-time conversion?
This seems fine to, as long as the raw image can always be a sparse image. Otherwise it will balloon in size. Are there situations where the raw image cannot be sparse?
Note that this still doesn't deal with the fact that you cannot load a raw image into an Apple Container, or into WSL2, so we still will have the complexity of handling multiple formats.
This is not only disk space but memory used to cache the image, if the vm does disable the page cache (e.g. cache=none in qemu).
I don't understand what that means.
When using images the data is cache in the page cache. If you copy the images multiple times and use them at the same time the same data will be cached in the page cache.
I suggest this behavior:
- When downloading an image, if the image is in uncompressed raw format, convert the image to raw format and delete the downloaded image.
I don't understand what this means either. Why do you need to convert a raw image to a raw image? Do you mean to compress it? How would that be useful? And why would you want to delete the uncompressed raw image?
This was a typo - it the image is not in raw uncmpressed format (e.g qcow2, raw.xz), convert it to raw uncompressed format.
For qcow2 there is no such thing its compressed or uncompressed qcow2. The image can contain compressed clusters, and this typical way images are delivered. When we convert qcow2 images we automatically uncompressed all clusters.
- When creating a raw disk, clone the cache image to the instance directory
That's what we already do when you have a raw image in the cache.
Yes, not change is needed
- When creating a qcow2 disk create an overlay using the raw image as backing file
If I understand this correctly, you mean we should never keep a qcow2 image, but always convert to raw?
Exactly.
Why would we then still want to create a qcow2 image, what would be the purpose?
If the user want to create qcow2 disk, we can support it. One advantage of qcow2 disk is internal snapshots. We can also simplify and always use raw format for os images, and use qcow2 format only for additional disks.
I don't really understand your suggestion, so maybe I totally got it wrong, but is the idea:
- The cache only has uncompressed raw images.
- If the source is a qcow2 image, it will be converted to a raw image after download.
Yes
And that's basically it; we never ever deal with qcow2 except for the one-time conversion?
Yes
This seems fine to, as long as the raw image can always be a sparse image. Otherwise it will balloon in size. Are there situations where the raw image cannot be sparse?
On any modern file system raw images are always sparse.
Note that this still doesn't deal with the fact that you cannot load a raw image into an Apple Container, or into WSL2, so we still will have the complexity of handling multiple formats.
For apple container you need a container image, but this is not really a vm disk image. If we want to manage these image we can keep different image formats.
AI tells me that
WSL2 utilizes a Virtual Hard Disk (VHD) format for storing the Linux file system. Specifically, it uses a VHDX file with an ext4 file system for each installed Linux distribution.
If we want to support these images we need to keep yet another image format that will be used only by WSL2. These images are not useful on other platforms, so this can be a windows only implementation detail.
When using images the data is cache in the page cache. If you copy the images multiple times and use them at the same time the same data will be cached in the page cache.
I think you will only be able to share page cache entries if you use the same file. Even for a clonefile or reflink copy there will be separate cache instances (because they have a different inode, and the cache is inode-based). So for this purpose it would be better to use hardlinks for a read-only backing file, and then layer qcow2 on top. But that won't work with vz.
I feel this would be over-optimizing things. We don't even know how many pages are kept resident, or are being swapped out / discarded on cache pressure.
If the user want to create qcow2 disk, we can support it. One advantage of qcow2 disk is internal snapshots. We can also simplify and always use raw format for os images, and use qcow2 format only for additional disks.
Is this a common enough use case to justify the complexity? Why not say we only support raw disks (on macOS and Linux), and users can make snapshots as reflink copies at the filesystem level?
AI tells me that
WSL2 utilizes a Virtual Hard Disk (VHD) format for storing the Linux file system. Specifically, it uses a VHDX file with an ext4 file system for each installed Linux distribution.
Yes, but you create these disks by importing a rootfs from a tarball, which can be created from a container image (or vice versa).