sparse file support for Virtual Machine backups
Virtual machine images(docker desktop, Oracle's Virtual Box, AWS VM images, etc) are typically sparse files to save disk space but allow the Virtual machine's internal disk image to grow as needed.
So a nominal maximum size 60GB VM image will typically be sparse and only use say 100MB of actual disk space to start with. Thus the compression or expansion factor (depending on how you look at it) in this not atypical example is ~ 600x.
Suppose I have a 4TB disk that is 2TB full of virtual machine images like the above.
I take a backup of my 4 TB disk.
Later, I need my backup, so now I try to restore to a new 4 TB disk. It fails.
What? I'm only restoring 2 TB worth of data.
Q: Why can't I restore?
A: No sparse file support.
Without sparse file support on writing, those 2TB of VM images now become 1200 TB of images on restore from backup; they won't fit on the size of disk they came from, even if to the user, there should be plenty of room.
Also the backup itself took up over ~ 1200 TB instead of ~ 2 TB. So I'm paying to store 600x more data than I really want to pay for.
Ouch.
Yes we need both backup level and restore support for sparse files. Assigned myself to it cause I have some PoC laying around for sparse files support.
An interesting note to this is that Apple in the APFS filesystem, and libarchive from BSD distributions in general, appear to actually use zero-scanning on write to produce sparse file -- even if the original was not from a sparse supporting origin, if the target for the write support sparseness(!)
This is an especially nice approach/feature in two regards.
-
If you restore to a target that supports sparse files, you get a sparse file. The original origin file system need not have supported sparse files to still get this benefit.
-
It simplifies the metadata storage. You don't need to mark or recall where the sparse holes are, only where the runs of logical zeros are in the data. However, see below, you might want to distinguish sparse from unwritten preallocated extents.
A second related refinement is to think about how to handle pre-allocated and as yet unwritten storage that is also logically zero. Systems like Facebook's Haystack and the Go version Seaweedfs use this feature to optimize append heavy workloads, and to avoid problems with out-of-disk-space due to fragmentation.
The idea with pre-allocation via the fallocate(fd, FALLOC_FL_KEEP_SIZE, start, len) call (or equivalent on fcntl darwin) is the filesystem will try to give you a single extent that contains all the space you will need for the file.
This is sort of the opposite of sparse, but the unwritten (but prealloated) space is still logical zeros. These logical zeros should be similarly run-length compressed (like sparse logical zero runs). On restore, the backup system should probably also try to pre-allocate the space -- so unwritten pre-allocated space would need metadata info, but fall back to writing a sparse hole instead if a contiguous extent of the required size is not available on the target filesystem (this is what Mozilla does e.g. https://stackoverflow.com/questions/11497567/fallocate-command-equivalent-in-os-x), so that the user gets a restored backup that does not fail, but just slower append operations because of the lack of pre-allocation (so issuing a warning would be appropriate).
edit: The other reason it is nice to store logical runs of zeros rather than a "sparseness map" is that filesystems can vary, even depending on the configuration of the same filesystem, in the size of the supported minimum size hole. On the filesystems that I have tried (AFPS, ext4, XFS), the minimize size is 4096; but other filesystems it might be different, or different when configured differently to optimize a workload, and so this should be queried at runtime when writing.
Another big hint when working with Apple's APFS filesystem is that the usual means of producing a sparse file (truncate to a size larger than the file), only works above a certain size threshold; so when create sparse test files to work on APFS, you will typically need to use 32MB if not 64MB in the extending Truncate() call. If only 16MB for example, it may not end up being sparse with just a Truncate. While you can punch holes in it afterwards, this is a pain compared to just Truncating long enough (and risks running out of space). Apple does not document the exact threshold, sadly, and in fact the threshold may depend on how the filesystem in question is configured, so tests may need to verify the behavior on the test filesystem before proceeding to avoid flakiness.
references
(run strace/dtrace on commands like cp, cpio, tar to see how they do this...)
from man fpathconf in unistd.h:
_PC_MIN_HOLE_SIZE
// "If a file system supports the reporting of holes (see lseek(2)),
// pathconf() and fpathconf() return a positive number that
// represents the minimum hole size returned in bytes. The offsets
// of holes returned will be aligned to this same value. A special
// value of 1 is returned if the file system does not specify the
// minimum hole size but still reports holes."