zfs
                                
                                
                                
                                    zfs copied to clipboard
                            
                            
                            
                        Performance improvement for metadata objects reading
Describe the feature would like to see added to OpenZFS
ZFS should try to write metadata objects in adjacent sectors of physical disks in order that vdev prefetch and physical disk cache to kick in.
It could be done by reserving let's say 5-10% of space on the disks for metadata only, so every metadata object is written next to its parent one, if possible.
Another possibility is to add background process, like scub, which performs metadata objects defragmentation.
How will this feature improve OpenZFS?
Currently, ZFS on big slow magnetic disks is performing quite poorly, when enumerating filesystem content with ls command. It generates about one IO operation for each file or directory, and when we have thousands of files and folders, simple ls command can take hours to complete.
Other filesystems, like ext4 are not doing better in this reagard.
I've noticed, that once metadata is in the ARC, ls is instantaneous. And also other commands, like rm are very fast on slow disks.
Additional context
I'm using zfs-2.1.4 on ubuntu 22, and the performance degradation is easily observable on a slow magnetic disk, with a lot of files and directories by using ls -R /your pool/fs name
What do you think about such a feature, is it doable?
Best regards, AR
Hi there,
I'm trying to understand the implications of such a feature, does somebody know if Sun's "ZFS On-Disk Specification" Draft is still relevant?
Readind zdb -dddd output is a bit overhelming.
Best regards, AR
I believe this will help with rsync --xattr too.
Reference: https://www.reddit.com/r/zfs/comments/sgg9iu/quirk_on_using_zfs_dataset_as_rsync_target_xattr/
- ZFS won't have (IMHO) any 
defragmechanisms in foreseeable future (defragby it's nature is a strange feature, it may be good for read-only workloads, but it's useless for average any-percent-write workload), you may search forblock pointer rewrite + ZFStopics on any retroactive data changes topic for ZFS, main point that it's really expensive to create - sequential meta on same data is nearly useless, because ANY change of it's meta (new file in dir, file append, etc) will add to meta, so proposed mechanism will have a performance penalty too, instead of gains
 - other FSes are doing the same because of that
 - any 
backgroundprocess which affects pool work is a poisonous practise and hard to maintain, you can't get a predicted performance in this case - you won't have a magic meta performance on storage disks with IOPS of 100+ (HDD), because meta is nearly random 99% of the time
 
tl;dr: I don't see real gains here, just use appropriate devices for IOPS (ssd/nvme/ big ARC)
PS: I wanted to point at defrag + sequential meta on disk specifically, because there'll always be a room for improvements on metadata logic itself.
Thank you very much CyberCr33p, suggestion to use a fast small SSD for L2ARC only for metadata looks promising, I'll investigate it.
But in general, it seems weird to me, that I can unzip an archive with 10 thousand folders and 100 thousand files to ZFS fs residing on a slow HDD with 150 IOPS in under one minute (archive of 600M, the sum of uzipped file sizes is 4GB).
But to list that uzipped folder takes one hour with HDD reading small pieces of data all over the disk. Looking at iosnoop output I see about one IO for each file and directory, reading 512 bytes to 3 kbytes of data at "random" locations.
I understand that any modification of meta will rewrite the original object to another location. For this reason, we have atime e relatime switched off.
I think it is worth investigating. For long-term storage which is only for appending new data and occasional reading the ability to read metadata efficiently could be a plus.
Thanks, AR
Another option which will guarantee fast metadata access would be to use a special device:
A device dedicated solely for allocating various kinds of internal metadata, and optionally small file blocks. The redundancy of this device should match the redundancy of the other normal devices in the pool. If more than one special device is specified, then allocations are load-balanced between those devices.
A small SSD or SSD mirror makes metadata accesses very fast and with recent ZFS 2.1.6 changes L2ARC and special device can complement each other instead of duplicating.
Yes, fast special vdev solves the problem. But it must be redundant, which adds complexity.
L2ARC with cache vdev, solves only partially. Because the first time it should be populated. And will be pushed out of cache eventually.
For data used occasionally on a big disk, it will not help. My use case is 20TB spinning disk, containing millions of files, in separate folders. Each such folder, after a year of service, gets zipped.
So my problem is that zip spends more time enumerating files in a folder than zipping the data in it.
I'm doing research, on how to optimize it. To support my research I've created disk activity visualization script.

It could be useful also in other scenarios.
Take a look.
Regards, AR
I've experimented with setting primarycache and/or secondarycache to metadata only, which prevents record data forcing metadata out of ARC/L2ARC as quickly, which has helped with general performance in my case, but this also potentially means some record data that would be nice to cache is also lost, and you're at the mercy of when L2ARC is purged (a once a week task may be too infrequent to benefit).
I posted #15118 with the intention of allowing special devices to be used to accelerate reading without the need to worry about the redundancy, since it would give much more predictable performance than messing with L2ARC.
Otherwise the only way to accelerate your rsync or zipping is to try to "preload" data into ARC/L2ARC, which would mean running a script to stat the files and load extended attributes so that as many as possible are already cached before the operation. This isn't really solving your problem though, but if you know which file metadata you need to preload it lets you do so in advance at least.
If the point is about speeding up metadata demand reads, then adding a fast special device will do the trick (at least for newly written data). In case it should be more flexible, like being able to affect existing data and removable if no longer needed, then something like the last paragraph of https://github.com/openzfs/zfs/issues/13460#issuecomment-1147375142 could be the most convenient way to scratch this particular itch.
@arg7 , fantastic tool !
Yes, fast special vdev solves the problem. But it must be redundant, which adds complexity. same opinion. i don't want to split my data across different types of disks, especially when zfs capable ssd should not be cheap consumer grade ssd.
besides optimizing opjects reading, it would be helpful if metadata eviction wouldn't be so inconvenient like now.
actually, i know of no method how to make metadata better stick in memory or at least let it be served from l2arc. i have spent hours and hours analyzing and fiddling with zfs params, it's absolutely frustrating to see, that you cannot have predictive metadata performance from arc.
https://github.com/openzfs/zfs/issues/12028