containerized-data-importer Importing a containerdisk onto a block volume loses sparseness

What happened: Importing a containerdisk onto a block volume loses sparseness. When I imported the centos-stream:9 containerdisk, which only uses 2 GB of non-zero data onto an empty 10 GB block volume, all 10 GB were written by CDI. Preallocation was not enabled.

What you expected to happen: Only the non-zero data should be written to the block volume. This saves space on the underlying storage.

How to reproduce it (as minimally and precisely as possible): Create a DataVolume from the YAML below and observe the amount of storage allocated. I used KubeSAN as the CSI driver, so the LVM lvs command can be used to see the thin provisioned storage usage. If you don't have thin provisioned storage you could use I/O stats or tracing to determine how much data is being written.

Additional context: I discussed this with @aglitke and we looked at the qemu-img command that is invoked:

Running qemu-img with args: [convert -t writeback -p -O raw /scratch/disk/disk.img /dev/cdi-block-volume]

Adding the --target-is-zero option should avoid writing every block in the target block volume.

If there are concerns that some new block volumes come uninitialized (blocks not zeroed), then it should be possible to run blkdiscard --zeroout /path/to/block/device before invoking qemu-img with --target-is-zero. I have not tested this, but blkdiscard should zero the device efficiently and fall back to writing zero buffers on old hardware. On modern devices this would still be faster and preserve sparseness compared to writing all zeroes. On old devices it would be slower, depending on how many non-zero the input disk image has.

Environment:

CDI version (use kubectl get deployments cdi-deployment -o yaml): 4.17.3
Kubernetes version (use kubectl version): v1.30.5
DV specification:

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/storage.bind.immediate.requested: "true"
    cdi.kubevirt.io/storage.import.lastUseTime: "2025-01-22T20:19:34.821435785Z"
    cdi.kubevirt.io/storage.usePopulator: "true"
  creationTimestamp: "2025-01-22T20:19:34Z"
  generation: 1
  labels:
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: cdi-controller
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.17.3
    cdi.kubevirt.io/dataImportCron: centos-stream-9-test-import-cron-3vs0wc
    instancetype.kubevirt.io/default-instancetype: u1.medium
    instancetype.kubevirt.io/default-preference: centos.stream9
  name: centos-stream-9-test-9515270a7f3d
  namespace: default
  resourceVersion: "217802868"
  uid: d8d56487-5686-47d0-93d0-79de02a8c2c3
spec:
  source:
    registry:
      url: docker://quay.io/containerdisks/centos-stream@sha256:9515270a7f3d3fd053732c15232071cb544d847e56aa2005f27002014b5becaa
  storage:
    resources:
      requests:
        storage: 10Gi
    storageClassName: kubesan

Cloud provider or hardware configuration: N/A
OS (e.g. from /etc/os-release): N/A
Kernel (e.g. uname -a): N/A
Install tools: N/A
Others: N/A

Jan 22 '25 20:01 stefanha

@stefanha Is there a particular version of qemu-img that has this support, or has it been there a long time? We should make sure the version of qemu-img used in CDI supports this flag.

Jan 27 '25 13:01 awels

--target-is-zero was introduced in QEMU 5.0.0 in April 2020. You can get an idea of which Linux distro releases include that QEMU version here: https://repology.org/project/qemu/versions

It is available starting from RHEL 8, Ubuntu 22.04, OpenSUSE Leap 15.4, Debian 10 (backports) or 11.

Jan 27 '25 14:01 stefanha

Super interesting, thanks for opening the issue! I am wondering how a certain test we have isn't catching this https://github.com/kubevirt/containerized-data-importer/blob/41b96ed638b4b3690635855fcb7e18f57455979c/tests/import_test.go#L1643 I know our sparseness verification util is broken ATM but even on a custom branch it passes: https://github.com/kubevirt/containerized-data-importer/pull/3213

Jan 28 '25 11:01 akalenyu

I am wondering how a certain test we have isn't catching this

This du(1) command-line probably isn't working as expected on a block device:

https://github.com/kubevirt/containerized-data-importer/blob/main/tests/framework/pvc.go#L552

Jan 28 '25 12:01 stefanha

I am wondering how a certain test we have isn't catching this

This du(1) command-line probably isn't working as expected on a block device:

https://github.com/kubevirt/containerized-data-importer/blob/main/tests/framework/pvc.go#L552

This PR gets rid of the du check and relies on qemu-img info actual size

Jan 28 '25 13:01 akalenyu

This explains it:

INFO: VerifySparse comparison: OriginalVirtual: 18874368 vs SizeOnDisk: 0

SizeOnDisk being qemuImgInfo.ActualSize (Output from that custom PR branch)

Jan 28 '25 13:01 akalenyu

There is no generic way in Linux to query a block device to find out how many blocks are allocated, so SizeOnDisk will not have a useful value.

Jan 28 '25 14:01 stefanha

Maybe this trick will work: create a sparse file using truncate(1) and then create a corresponding loopback block device using losetup(8). The test would be able to look at the blocks allocated in the underlying sparse file to get an approximation of the number of block touched on the loopback block device, modulo file system effects like its block size.

Jan 28 '25 14:01 stefanha

Maybe this trick will work: create a sparse file using truncate(1) and then create a corresponding loopback block device using losetup(8). The test would be able to look at the blocks allocated in the underlying sparse file to get an approximation of the number of block touched on the loopback block device, modulo file system effects like its block size.

Can't we just use dd and copy the content into a scratch filesystem volume?

Jan 28 '25 14:01 akalenyu

Can't we just use dd and copy the content into a scratch filesystem volume?

If I understand correctly, the test is attempting to verify that the block device was written sparsely (non-zero blocks were skipped). Simply using dd to copy the block device to a file won't show whether the block device was written sparsely, so I don't think that approach works.

You could first populate the block device with a pattern, import the containerdisk, and then check to see whether the pattern is still visible in blocks where the containerdisk is zero. That last step could be a single checksum comparison of the contents of the whole disk.

Jan 28 '25 14:01 stefanha

Looks like this issue is somehow hidden with ceph rbd (using same 10Gi image)

# rook-ceph-toolbox pod
$ ceph df -f json | jq .pools[0].stats.bytes_used
2036678656

Jan 28 '25 19:01 akalenyu

Ceph might be doing zero detection or deduplication? Even if this is the case, you should be able to see the issue by running the same qemu-img command-line as CDI under strace(1) and looking at the pattern of write syscalls.

Jan 29 '25 14:01 stefanha

/assign @noamasu

Mar 13 '25 08:03 noamasu

Discussed this in the sig-storage meeting; we should be okay with just unit testing the cmd line args to qemu-img and trust that the rest is verified on the virt layer. I just think a functest for this is going to get too involved/coupled to a specific storage type

Mar 24 '25 13:03 akalenyu

A quick update

After performing various tests about this, here is what was observed: looking at /proc/diskstats (making sure all values are zeroed for the destination device before running the test

without --target-is-zero:

 251      16 rbd1 91 0 9952 53 3144 496557 20971521 92545 0 7750 92598 0 0 0 0 00
 0

with --target-is-zero:

251      16 rbd1 91 1 9952 62 677 496557 3977872 89978 0 7185 90041 0 0 0 0 0 0

I/O: 3144 I/O (with) vs 677 I/O (without) sectors written: 20,971,521 (with) vs 3,977,872 (without) As expected, using --target-is-zero resulted in significantly lower I/O operations and sectors written.

looking at the block device usage (an rbd): without --target-is-zero:

sh-5.1# rbd du replicapool/csi-vol-fa8b69d1-11ff-43d1-8325-6211b91d7110 --mon-host 10.104.50.186:6789 --keyring /etc/ceph/keyring-store/keyring
warning: fast-diff map is not enabled for csi-vol-fa8b69d1-11ff-43d1-8325-6211b91d7110. operation may be slow.
NAME                                          PROVISIONED  USED   
csi-vol-fa8b69d1-11ff-43d1-8325-6211b91d7110       10 GiB  1.9 GiB

with --target-is-zero:

sh-5.1# rbd du replicapool/csi-vol-c58e2ac8-a400-45ea-a0bd-204576c17a88 --mon-host 10.104.50.186:6789 --keyring /etc/ceph/keyring-store/keyring
warning: fast-diff map is not enabled for csi-vol-c58e2ac8-a400-45ea-a0bd-204576c17a88. operation may be slow.
NAME                                          PROVISIONED  USED   
csi-vol-c58e2ac8-a400-45ea-a0bd-204576c17a88       10 GiB  1.9 GiB

Both configurations show a provisioned size of 10 GiB with 1.9 GiB in use.

looking at the block device usage (local, creating loop device truncate -s 10G test.img && losetup -f test.img): with --target-is-zero:

[root@node01 ~]# ls -ltrs
total 1988956
1988940 -rw-r--r--. 1 root root 10737418240 Mar 31 09:41 test.img

without --target-is-zero:

[root@node01 ~]# ls -ltrs
total 1988956
1988940 -rw-r--r--. 1 root root 10737418240 Mar 31 10:09 test.img

Both methods resulted in a physical size of 1988940 blocks for the test image (being presented as a loop device).

incorporating the --target-is-zero flag (along with -n to ensure only non-zero sectors are written) effectively reduces I/O operations and the number of sectors written (regardless of the destination device type or whether a sparseness mechanism is available at the block driver level.)

For now, a PR was made just to add these flags, but further verification is needed to determine whether:

Adding a command such as blkdiscard --zeroout /path/to/block/device is necessary to fully zero the device, understanding its execution time impact and if its worth it.
Running cmp --bytes=$(blockdev --getsize64 <device>) <device> /dev/zero is required before zeroing the disk (took around 4.5 seconds for a 10GB device.

Apr 01 '25 08:04 noamasu

The loop device results are surprising to me. I expected the number of allocated blocks to be less for --target-is-zero. Not sure why this happens.

Apr 01 '25 12:04 stefanha

Maybe it has to do with the filesystem test.img was created on? anyway this is most observable with LVM storage when invoking lvs

Apr 01 '25 12:04 akalenyu

testing on thinly provisioned lvm "thinlv"

//with --target-is-zero
[root@node01 mapper]# lvs
  LV       VG Attr       LSize  Pool     Origin Data%  Meta%  Move Log Cpy%Sync Convert
  thinlv   vg Vwi-a-tz-- 10.00g thinpool        19.04                                  
  thinpool vg twi-aotz-- 15.00g                 12.70  13.67                  

//without --target-is-zero
[root@node01 ~]# lvs                                      
  LV       VG Attr       LSize  Pool     Origin Data%  Meta%  Move Log Cpy%Sync Convert
  thinlv   vg Vwi-a-tz-- 10.00g thinpool        100.00                                  
  thinpool vg twi-aotz-- 15.00g                 66.67  26.81                           
[root@node01 ~]# ls -ltrhs
 11G -rw-r--r--. 1 root root  20G Apr  1 19:45 test.img

indeed, we easily see the difference. and even this time, the size of the test.img (loop0 device that has the lvm on it) is 11GB when copying to the LVM.

interesting, so looks like raw /dev/loop0 device will save sparseness of the truncated file (even when without --target-is-zero), but not when it has LVM on it.

Apr 01 '25 20:04 noamasu

Short summary from sig-storage meeting - we should be okay to assume we get a --target-is-zero fit device. Longer discussion https://youtu.be/4oIO7YCL4IM?t=1128

Apr 08 '25 12:04 akalenyu

@akalenyu and I came across an issue specific to preserving sparseness on LVM thin LVs. Using --target-is-zero avoids the issue. I'm capturing the details here in case they are needed in the future.

The following test looks at the ability to preserve sparseness on LVM thin LVs:

dd if=/dev/urandom of=src.img bs=1M count=1 && truncate -s 1G src.img && \
rm -f target.img && truncate -s 2G target.img && sudo losetup -f target.img && \
sudo vgcreate testvg /dev/loop0 && \
echo 'VG created' && stat target.img && \
sudo lvcreate --type thin-pool -l '100%FREE' testvg && \
echo 'thin-pool created' && stat target.img && \
sudo lvcreate --thinpool lvol1 -V 1G --name testlv testvg && \
echo 'LV created' && stat target.img && \
sudo chmod 666 /dev/testvg/testlv && \
strace -f -o /tmp/strace.log ~/qemu/build/qemu-img convert -t writeback -p -O raw src.img /dev/testvg/testlv && \
echo 'qemu-img convert done' && stat target.img && \
sudo lvchange -an testvg/testlv && \
sudo lvchange -an testvg/lvol1 && \
sudo vgchange -an testvg && \
sudo losetup -d /dev/loop0 && \
echo 'final stat' && stat target.img

Relevant output:

VG created
  Size: 2147483648	Blocks: 256        IO Block: 4096   regular file
thin-pool created
  Size: 2147483648	Blocks: 16640      IO Block: 4096   regular file
LV created
  Size: 2147483648	Blocks: 16640      IO Block: 4096   regular file
qemu-img convert done
  Size: 2147483648	Blocks: 2113792    IO Block: 4096   regular file
final stat
  Size: 2147483648	Blocks: 2113792    IO Block: 4096   regular file

And strace.log contains:

fallocate(8, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 1058013184, 15728640) = -1 EOPNOTSUPP (Operation not supported)

The bulk of the data was allocated due to qemu-img convert. The reason was that fallocate(2) failed to write zeroes with unmap and qemu-img fell back to writing zero buffers to the device.

Apr 15 '25 15:04 stefanha

@akalenyu Could you provide an update on where we stand with this?

Jun 02 '25 12:06 aglitke

we've summarized things on the PR, this is /held for now https://github.com/kubevirt/containerized-data-importer/pull/3695#issuecomment-2835076254

Jun 03 '25 07:06 akalenyu

first commit in here is relevant - https://github.com/container-storage-interface/spec/pull/571

Aug 10 '25 08:08 akalenyu

@akalenyu Nice! if that's the case, i think we can add the --target-is-zero flag and assume that if the target wasn't zeroed out, its the vendor csi driver's responsibility?

Aug 10 '25 08:08 noamasu

@akalenyu Nice! if that's the case, i think we can add the --target-is-zero flag and assume that if the target wasn't zeroed out, its the vendor csi driver's responsibility?

Since the PR is not merged and is stale for a year, unfortunately, it's not a promise we can build upon :(

Aug 10 '25 08:08 akalenyu

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Nov 08 '25 09:11 kubevirt-bot