zfs
zfs copied to clipboard
ZFS Pool on top of ZVOL on Proxmox VE - EXTREME Overhead and Used Space reported by Proxmox VE Host
System information
Host (Proxmox VE):
| Type | Version/Name |
|---|---|
| Distribution Name | Proxmox VE / Debian GNU/Linux |
| Distribution Version | Bookworm (12) with Proxmox VE Packages |
| Kernel Version | Linux pve16 6.5.13-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-3 (2024-03-20T10:45Z) x86_64 GNU/Linux |
| Architecture | x86_64 / amd64 |
| OpenZFS Version | zfs-2.2.3-pve1 / zfs-kmod-2.2.3-pve1 |
Guest VM (Debian GNU/Linux KVM):
| Type | Version/Name |
|---|---|
| Distribution Name | Debian GNU/Linux |
| Distribution Version | Bookworm (12) with Bookworm-Backports for ZFS/Kernel/Podman |
| Kernel Version | Linux GUEST 6.6.13+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.6.13-1~bpo12+1 (2024-02-15) x86_64 GNU/Linux |
| Architecture | x86_64 / amd64 |
| OpenZFS Version | zfs-2.2.3-1~bpo12+1 / zfs-kmod-2.2.3-1~bpo12+1 |
Describe the problem you're observing
I am facing some EXTREME overhead when creating a ZFS zpool in a Proxmox VE VM Guest on top of a ZVOL on the Host.
The ZFS Pool on the Host System is also sitting on top of a LUKS / Cryptsetup Full-Disk-Encryption, although I do NOT think this is relevant for the issue described here (since the Host ZFS Pool is sitting on top of the DMCrypt / LUKS Device).
zfs list on the host:
NAME USED AVAIL REFER MOUNTPOINT
...
rpool/data/vm-103-disk-0 8.27G 276G 8.23G -
rpool/data/vm-103-disk-1 67.6G 276G 67.6G -
...
zpool get all rpool Host Pool Properties:
root@pve16:/tools_nfs# zpool get all rpool
NAME PROPERTY VALUE SOURCE
rpool size 920G -
rpool capacity 66% -
rpool altroot - default
rpool health ONLINE -
rpool guid 4485595745105166796 -
rpool version - default
rpool bootfs - default
rpool delegation on default
rpool autoreplace off default
rpool cachefile - default
rpool failmode wait default
rpool listsnapshots off default
rpool autoexpand off default
rpool dedupratio 1.00x -
rpool free 307G -
rpool allocated 613G -
rpool readonly off -
rpool ashift 12 local
rpool comment - default
rpool expandsize - -
rpool freeing 0 -
rpool fragmentation 68% -
rpool leaked 0 -
rpool multihost off default
rpool checkpoint - -
rpool load_guid 3306564752570665611 -
rpool autotrim off default
rpool compatibility off default
rpool bcloneused 0 -
rpool bclonesaved 0 -
rpool bcloneratio 1.00x -
rpool feature@async_destroy enabled local
rpool feature@empty_bpobj active local
rpool feature@lz4_compress active local
rpool feature@multi_vdev_crash_dump enabled local
rpool feature@spacemap_histogram active local
rpool feature@enabled_txg active local
rpool feature@hole_birth active local
rpool feature@extensible_dataset active local
rpool feature@embedded_data active local
rpool feature@bookmarks active local
rpool feature@filesystem_limits enabled local
rpool feature@large_blocks enabled local
rpool feature@large_dnode enabled local
rpool feature@sha512 enabled local
rpool feature@skein enabled local
rpool feature@edonr enabled local
rpool feature@userobj_accounting active local
rpool feature@encryption enabled local
rpool feature@project_quota active local
rpool feature@device_removal enabled local
rpool feature@obsolete_counts enabled local
rpool feature@zpool_checkpoint enabled local
rpool feature@spacemap_v2 active local
rpool feature@allocation_classes enabled local
rpool feature@resilver_defer enabled local
rpool feature@bookmark_v2 active local
rpool feature@redaction_bookmarks enabled local
rpool feature@redacted_datasets enabled local
rpool feature@bookmark_written active local
rpool feature@log_spacemap active local
rpool feature@livelist enabled local
rpool feature@device_rebuild enabled local
rpool feature@zstd_compress enabled local
rpool feature@draid enabled local
rpool feature@zilsaxattr disabled local
rpool feature@head_errlog disabled local
rpool feature@blake3 disabled local
rpool feature@block_cloning disabled local
rpool feature@vdev_zaps_v2 disabled local
zfs get all rpool/data/vm-103-disk-1 Host ZFS Properties:
NAME PROPERTY VALUE SOURCE
rpool/data/vm-103-disk-1 type volume -
rpool/data/vm-103-disk-1 creation Tue Apr 9 22:26 2024 -
rpool/data/vm-103-disk-1 used 67.6G -
rpool/data/vm-103-disk-1 available 276G -
rpool/data/vm-103-disk-1 referenced 67.6G -
rpool/data/vm-103-disk-1 compressratio 1.84x -
rpool/data/vm-103-disk-1 reservation none default
rpool/data/vm-103-disk-1 volsize 512G local
rpool/data/vm-103-disk-1 volblocksize 16K default
rpool/data/vm-103-disk-1 checksum on default
rpool/data/vm-103-disk-1 compression lz4 inherited from rpool
rpool/data/vm-103-disk-1 readonly off default
rpool/data/vm-103-disk-1 createtxg 2480671 -
rpool/data/vm-103-disk-1 copies 1 default
rpool/data/vm-103-disk-1 refreservation none default
rpool/data/vm-103-disk-1 guid 967947631676329448 -
rpool/data/vm-103-disk-1 primarycache all default
rpool/data/vm-103-disk-1 secondarycache all default
rpool/data/vm-103-disk-1 usedbysnapshots 15.9M -
rpool/data/vm-103-disk-1 usedbydataset 67.6G -
rpool/data/vm-103-disk-1 usedbychildren 0B -
rpool/data/vm-103-disk-1 usedbyrefreservation 0B -
rpool/data/vm-103-disk-1 logbias latency default
rpool/data/vm-103-disk-1 objsetid 159445 -
rpool/data/vm-103-disk-1 dedup off default
rpool/data/vm-103-disk-1 mlslabel none default
rpool/data/vm-103-disk-1 sync standard default
rpool/data/vm-103-disk-1 refcompressratio 1.84x -
rpool/data/vm-103-disk-1 written 2.11M -
rpool/data/vm-103-disk-1 logicalused 124G -
rpool/data/vm-103-disk-1 logicalreferenced 124G -
rpool/data/vm-103-disk-1 volmode default default
rpool/data/vm-103-disk-1 snapshot_limit none default
rpool/data/vm-103-disk-1 snapshot_count none default
rpool/data/vm-103-disk-1 snapdev hidden default
rpool/data/vm-103-disk-1 context none default
rpool/data/vm-103-disk-1 fscontext none default
rpool/data/vm-103-disk-1 defcontext none default
rpool/data/vm-103-disk-1 rootcontext none default
rpool/data/vm-103-disk-1 redundant_metadata all default
rpool/data/vm-103-disk-1 encryption off default
rpool/data/vm-103-disk-1 keylocation none default
rpool/data/vm-103-disk-1 keyformat none default
rpool/data/vm-103-disk-1 pbkdf2iters 0 default
rpool/data/vm-103-disk-1 snapshots_changed Tue Apr 9 23:30:02 2024 -
- KVM Guest / is stored on
vm-103-disk-0is ext4 on top of ZVOL for comparison purposes. - KVM Guest Containers Data (Podman) is stored on
vm-103-disk-1is the ZFS Pool on top of ZVOL which has the issue.
df -ah for / Guest VM filesystem vm-103-disk-0 (for comparison purposes) which is based on:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 30G 5.7G 23G 21% /
So 5.7G used on the Guest and 8.27G on the Host. Overhead is roughly 45% (8.27G/5.7G - 1)*100%.
zfs list on the Guest VM vm-103-disk-1 Container Data Storage (Podman):
NAME USED AVAIL REFER MOUNTPOINT
zdata 793M 491G 96K /zdata
zdata/PODMAN 726M 491G 136K /zdata/PODMAN
Overhead is roughly 8350% (67.6G/0.8G - 1)*100% !!!
zpool get all zdata Guest VM Pool Properties:
NAME PROPERTY VALUE SOURCE
zdata size 508G -
zdata capacity 0% -
zdata altroot - default
zdata health ONLINE -
zdata guid 11398056056706436130 -
zdata version - default
zdata bootfs - default
zdata delegation on default
zdata autoreplace off default
zdata cachefile - default
zdata failmode wait default
zdata listsnapshots off default
zdata autoexpand off default
zdata dedupratio 1.00x -
zdata free 507G -
zdata allocated 795M -
zdata readonly off -
zdata ashift 12 local
zdata comment - default
zdata expandsize - -
zdata freeing 0 -
zdata fragmentation 0% -
zdata leaked 0 -
zdata multihost off default
zdata checkpoint - -
zdata load_guid 6683170848207782254 -
zdata autotrim off default
zdata compatibility off default
zdata bcloneused 0 -
zdata bclonesaved 0 -
zdata bcloneratio 1.00x -
zdata feature@async_destroy enabled local
zdata feature@empty_bpobj active local
zdata feature@lz4_compress active local
zdata feature@multi_vdev_crash_dump enabled local
zdata feature@spacemap_histogram active local
zdata feature@enabled_txg active local
zdata feature@hole_birth active local
zdata feature@extensible_dataset active local
zdata feature@embedded_data active local
zdata feature@bookmarks enabled local
zdata feature@filesystem_limits enabled local
zdata feature@large_blocks enabled local
zdata feature@large_dnode enabled local
zdata feature@sha512 enabled local
zdata feature@skein enabled local
zdata feature@edonr enabled local
zdata feature@userobj_accounting active local
zdata feature@encryption enabled local
zdata feature@project_quota active local
zdata feature@device_removal enabled local
zdata feature@obsolete_counts enabled local
zdata feature@zpool_checkpoint enabled local
zdata feature@spacemap_v2 active local
zdata feature@allocation_classes enabled local
zdata feature@resilver_defer enabled local
zdata feature@bookmark_v2 enabled local
zdata feature@redaction_bookmarks enabled local
zdata feature@redacted_datasets enabled local
zdata feature@bookmark_written enabled local
zdata feature@log_spacemap active local
zdata feature@livelist enabled local
zdata feature@device_rebuild enabled local
zdata feature@zstd_compress enabled local
zdata feature@draid enabled local
zdata feature@zilsaxattr disabled local
zdata feature@head_errlog disabled local
zdata feature@blake3 disabled local
zdata feature@block_cloning disabled local
zdata feature@vdev_zaps_v2 disabled local
zfs get all zdata Guest ZFS Properties:
NAME PROPERTY VALUE SOURCE
zdata type filesystem -
zdata creation Sat Dec 30 21:26 2023 -
zdata used 794M -
zdata available 491G -
zdata referenced 96K -
zdata compressratio 1.42x -
zdata mounted no -
zdata quota none default
zdata reservation none default
zdata recordsize 128K default
zdata mountpoint /zdata local
zdata sharenfs off default
zdata checksum on default
zdata compression off local
zdata atime off local
zdata devices on default
zdata exec on default
zdata setuid on default
zdata readonly off default
zdata zoned off default
zdata snapdir hidden default
zdata aclmode discard default
zdata aclinherit restricted default
zdata createtxg 1 -
zdata canmount off local
zdata xattr on default
zdata copies 1 default
zdata version 5 -
zdata utf8only off -
zdata normalization none -
zdata casesensitivity sensitive -
zdata vscan off default
zdata nbmand off default
zdata sharesmb off default
zdata refquota none default
zdata refreservation none default
zdata guid 1402683579569969850 -
zdata primarycache all default
zdata secondarycache all default
zdata usedbysnapshots 0B -
zdata usedbydataset 96K -
zdata usedbychildren 794M -
zdata usedbyrefreservation 0B -
zdata logbias latency default
zdata objsetid 54 -
zdata dedup off default
zdata mlslabel none default
zdata sync standard default
zdata dnodesize legacy default
zdata refcompressratio 1.00x -
zdata written 0 -
zdata logicalused 1.00G -
zdata logicalreferenced 42K -
zdata volmode default default
zdata filesystem_limit none default
zdata snapshot_limit none default
zdata filesystem_count none default
zdata snapshot_count none default
zdata snapdev hidden default
zdata acltype off default
zdata context none default
zdata fscontext none default
zdata defcontext none default
zdata rootcontext none default
zdata relatime on default
zdata redundant_metadata all default
zdata overlay on default
zdata encryption off default
zdata keylocation none default
zdata keyformat none default
zdata pbkdf2iters 0 default
zdata special_small_blocks 0 default
zdata snapshots_changed Tue Apr 9 23:30:01 2024 -
Note: it is possible that this issue is caused by block size / volblocksize or similar parameter, since Podman / Docker containers could generate lots of small files.
root@GUEST:/# find /home/podman/ -type f | wc -l
14475
That does not sound like much though ... On the GUEST recordsize is set to 128K, that's probably a bit high, isn't it ?
Regardless, just because of the number of files, that would yield: 14475 x 128K = 1852800K = 1852.8 M = 1.85 G
So probably it's causing some overhead inside the guest, but not on the level of the overhead between guest and host ...
Describe how to reproduce the problem
I don't think that it's necessary to have a VM to replicate this.
Probably just on one system (Host) it's sufficient to create a ZFS Pool on top of the ZVOL.
I disabled compression on the Guest level since it would only cause additional CPU load for no apparent benefit. Therefore compression should NOT be the cause of this huge overhead.
Notes
The idea of having a ZFS Pool on top of the ZVOL is have better control over ZFS Snapshots.
In this case, after many other things will have been configured correctly, the snapshot & backup plan of rpool/data/vm-103-disk-1 could be performed by the Guest, as opposed to the host for many other VMs.
This can avoid backing up non-useful Data (such as Container Images or Container Storage) and only backup Useful / Critical Data (Containers Configuration, Secrets, Data, Certificates, Volumes, ...) thus saving a lot on Disk Space on the Backup Server.
Include any warning/errors/backtraces from the system logs
"TRIM"
It should be disabled on the host because it's ZFS on top of LUKS. That's the default behavior from what I understood at least.
However systemctl status fstrim.timer reports on the Host:
● fstrim.timer - Discard unused blocks once a week
Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; preset: enabled)
Active: active (waiting) since Fri 2024-04-05 10:29:41 CEST; 4 days ago
Trigger: Mon 2024-04-15 01:37:09 CEST; 5 days left
Triggers: ● fstrim.service
Docs: man:fstrim
Apr 05 10:29:41 pve16 systemd[1]: Started fstrim.timer - Discard unused blocks once a week.
On the guest you might be right though
I always have SSD Emulation + Discard + IO thread enabled on all of my VMs.
But zpool get autotrim returns off both for the Host and the Guest VM.
systemctl status fstrim.timer reports on the Guest VM:
○ fstrim.timer - Discard unused blocks once a week
Loaded: loaded (/lib/systemd/system/fstrim.timer; disabled; preset: enabled)
Active: inactive (dead)
Trigger: n/a
Triggers: ● fstrim.service
Docs: man:fstrim
Any other command to check ?
You misunderstood.
fstrim doesn't do anything with ZFS, and absent autotrim, it's not going to issue such requests without an explicit zpool trim in the guest, leaving the space that was freed in the guest still marked in use on the host.
And that's ZFS-specific ? I mean the ext4 partition for / on top the the ZVOL (like many other containers I have) do not really have this problem.
Somewhere I think I read that zpool trim is kinda dangerous concerning data loss. Isn't it ?
Yes, the command zpool trim is ZFS specific.
There was an uncommon race with data mangling using any kind of TRIM that was fixed in 2.2 and 2.1.14.
I wouldn't suggest using any FS that you don't want to use TRIM with inside a VM if you're worried about the space usage when things are freed and not deleted on the host.
What do you mean exactly by your latest statement ? That I should run zpool trim or that I should not be running ZFS on top of ZVOL ?
- You should be using TRIM inside the VM on whatever filesystems you're using if you're worried about the delta between space reported used in the VM and space actually used on the zvol
- You should, conversely, probably not run a filesystem you're not willing to use TRIM on in that VM if you're worried about that.
- So if you think the aforementioned bug is a sign you shouldn't trust TRIM on ZFS, you should probably not run ZFS inside the VM.
- (I don't think that's the case, personally, but you may hold a different opinion than me.)
Thanks.
Hopefully there won't be any regression of that bug :D.
However nothing seems to be happening.
Issued zpool trim zdata on the Guest and zpool status -t reports on the Guest:
pool: zdata
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:07 with 0 errors on Sun Mar 10 00:24:09 2024
config:
NAME STATE READ WRITE CKSUM
zdata ONLINE 0 0 0
PODMAN ONLINE 0 0 0 (100% trimmed, completed at Wed 10 Apr 2024 12:52:57 AM CEST)
errors: No known data errors
Host zfs list | grep "vm-103-disk-1" on the Host:
NAME USED AVAIL REFER MOUNTPOINT
rpool/data/vm-103-disk-1 67.8G 274G 989M -
Granted it could be running in the background. But for now there is absolutely no change.
You may note that there are 3 columns there, and "referenced" changed pretty substantially.
True. So I just need to destroy the old snapshots of that dataset on the Host.
Yep.
Now zfs list | grep "vm-103-disk-1" yields:
NAME USED AVAIL REFER MOUNTPOINT
rpool/data/vm-103-disk-1 989M 312G 989M -
Maybe the root filesystem of the Guest VM (ext4) has autotrim enabled by default then ?
That could explain the behavior ...
For reference, you could have seen if that was going to happen before doing it with zfs destroy -nv [list of snapshots] or by looking at the 4 different "usedby" properties which sum to USED.
zpool trim is for ZFS. The root FS is ext4, so fstrim will do as you expect.
Or if it's safe enough enable ZFS autotrim on the Guest VM ?
zpool set autotrim=on zdata
The bug wasn't specific to automatic or manual trim, afair, so autotrim should be no less safe than manual trim.
There was an uncommon race with data mangling using any kind of TRIM that was fixed in 2.2 and 2.1.14.
There seems to be a new issue in 2.2.3: #16056
Just so you're aware
Kind of.
Note that #16056, from my quick reading, seems to be the result of using hardware that lied and gave an invalid value for how big TRIM can be, combined with a failure in error case handling for that since that should not really ever happen. #16070 fixes the latter, but it's not entirely clear what the right thing to do if I'm right about the former is.