zfs
zfs copied to clipboard
400% space waste because the -o ashift=9 not working in the ZOL 2.x.x
System information
Type | Version/Name |
---|---|
Distribution Name | CentOS |
Distribution Version | 7.9 |
Kernel Version | 3.10.0-1160.49.1.el7_lustre.x86_64 |
Architecture | AMD64 |
OpenZFS Version | 2.0.7 |
Describe the problem you're observing
Because there are some extended attributes in the lustre filesystem.
4KiB block will waste too much (400%)
Here is after I replicated from ashfit=12(test_0) to the ashfit=9(test_1)
The test_0 has allocated 3.98T, after replicating the test_1 only 971G
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
test_0 6.94T 3.98T 2.96T - 80% 57% 1.00x ONLINE -
test_1 6.94T 971G 5.99T - 5% 13% 1.00x ONLINE -
When I test the ZOL 2.0.7. the ashfit=9 does not work......
dmesg | grep "ZFS pool version"
[ 4634.117936] ZFS: Loaded module v2.0.7-1, ZFS pool version 5000, ZFS filesystem version 5
zpool create -o ashift=9 tank raidz3 /dev/sd{a..p}
zdb -l /dev/sda1 | grep shi
metaslab_shift: 34
ashift: 12
The same openzfs/zfs/issues/13557
Hope the ashfit=9 could work in the future version.
Thank you.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
You can improve the space efficiency of the xattrs stored by Lustre by setting the property dnodesize=1k
and xattr=sa
. This should provide enough space for the xattrs to be co-located with the dnodes on disk which is also good for performance. Unfortunately, due to a bug this isn't currently the default Lustre behavior as it should be, https://jira.whamcloud.com/browse/LU-16017.
Regarding setting the ashift you'll want to verify that all your disks support a logical sector size of 512. If even one is a native 4k drive then ZFS won't be able to use a smaller ashift. You can check this by reading /sys/block/*/queue/logical_block_size
.
Hi Behlendorf Thank you All HDDs are the 512e
# cat /sys/block/sd{a..q}/queue/physical_block_size
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
# cat /sys/block/sd{a..q}/queue/logical_block_size
512
512
512
512
512
512
512
512
512
512
512
512
512
512
512
512
512
# arc_summary | grep ashif
vdev_file_logical_ashift 9
vdev_file_physical_ashift 9
zfs_vdev_max_auto_ashift 16
zfs_vdev_min_auto_ashift 9
# for i in {a..q}
> do
> zdb -l /dev/sd${i}1 | grep ashift
> done
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
ashift: 12
# zpool status
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
errors: No known data errors
Isn't 512e basically 4k sectors with firmware absorbing the RMW write cycles (poorly)? I think zfs is doing the right thing here, you really want 512n drives if you have a lot of small writes like this. Anything else is just a really bad fiction.
Now, zpool not listening to ashift=9 even when the device clearly supports it, well maybe that's a bug.
Indeed they are according to /sys/block
output above. The physical_block_size=4k
and logical_block_size=512
so I think ZFS is doing the right think by defaulting to 4k (ashift=12) in order avoid a lot of nasty performance killing RMW on the drive. Still, it does look like a bug that you can't explicitly request it when the drive does support it.
Hi Jesus, Behlendorf
Thanks in advance
The bug means if I add "-o ashift=9" it does not work.
Eg: when I register the GitHub account, I could agree to or give up the User Agreement.
I mean I understand there is a lot of nasty performance killing on the drive.
In the performance case, I will switch to the 4KiB block.
I have three reasons why I need the ashift=9 in my production env
- Too many tiny file case
- 5.5/11.5=47.8%, if you have tons of tiny files, oh No......
- The application benchmark
- The more(useful capacity) the better case, it 'does not care about the performance
- The 164TiB/182TiB=90%, lost 10% useful capacity
Here is the test result in 0.7.13 because I can't switch to ashift=9 in 2.X.X
here is the 16x 16TB raidz3 test case
# zpool create -f tank -o ashift=9 raidz3 /dev/sd{a..p}
# df -h /tank
Filesystem Size Used Avail Use% Mounted on
tank 182T 0 182T 0% /tank
# cd /tank
# openssl rand -out 4K.file 4096
# ls -lhs
total 5.5K
5.5K -rw-r--r-- 1 root root 4.0K Jul 27 09:26 4K.file
|
------------dsize
# zdb -v -O tank 4K.file
Object lvl iblk dblk dsize dnsize lsize %full type
2 1 128K 4K 5.00K 512 4K 100.00 ZFS plain file
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
------------------------------to test ashift=12----------------------------------
# zpool destroy tank
# zpool create -f tank -o ashift=12 raidz3 /dev/sd{a..p}
# df -h /tank
Filesystem Size Used Avail Use% Mounted on
tank 164T 256K 164T 1% /tank
# cd /tank
# openssl rand -out 4K.file 4096
# ls -lhs
total 12K
12K -rw-r--r-- 1 root root 4.0K Jul 27 09:23 4K.file
|
----------dsize
# zdb -v -O tank test_4K
Object lvl iblk dblk dsize dnsize lsize %full type
4 1 128K 512 11.5K 512 512 100.00 ZFS plain file
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
Arguably your point 3 would maybe be better addressed with compression on a dataset, letting zfs burn some CPU cycles decompressing the metadata rather than letting drive firmware grab the emulated block. You still effectively get an inflated read but the extra bytes are then in ARC rather than sitting on some more transient buffer on the drive's controller. The same is even more true in the other direction for writes, though you still might encounter a RMW scenario in the event of synchronous transactions.
But, your point still stands, if the device is capable of a 512b write and you tell it to do so against your best interest, it probably should let you, with some loud warnings at least.
Hi Jesus Yes, hope the option could work. with some loud warnings is OK.
The easy way to test ashift=9 in 2.0.7
- ./module/os/linux/zfs/vdev_disk.c:334: *physical_ashift = highbit64(MAX(physical_block_size,
+ ./module/os/linux/zfs/vdev_disk.c:334: *physical_ashift = highbit64(MIN(physical_block_size,
In my ashfit=9 test, Severe performance decreased in one HDD vendor
Another one 512B is OK
If you want to upgrade to 20+TB HDD with ashfit=9, zfs 0.7.X is not appropriate, must be 0.8 or higher
And if you add special class for offloading, two vendors could work well in our work loading
If you want to get the high performance, ashfit=12 is the only choice
Here is the directory test under 2.1.11. It appears that ashift 12 also wastes a lot of space.
If I move data from the ashift 9 zpool to the ashift 12 zpool, the ashift 12 zpool may not be able to save all of the data if they are the same capacity.
[ 31.457200] ZFS: Loaded module v2.1.11-1, ZFS pool version 5000, ZFS filesystem version 5
There are 2 x raidz3(16+3),the test_ost_1-4K set by ashift=12,the test_ost_0 is ashift=9
zdb -v -O test_ost_1-4K xxx/xxx/xxx/dir_0 | head
Object lvl iblk dblk dsize dnsize lsize %full type
1290 2 128K 16K 12.9M 1K 4.02M 100.00 ZFS directory
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 256
uid 0
gid 0
atime Wed Jun 21 14:38:17 2023
mtime Wed Jun 21 12:52:42 2023
zdb -v -O test_ost_0 xxx/xxx/xxx/dir_0 | head
Object lvl iblk dblk dsize dnsize lsize %full type
2182 2 128K 16K 3.37M 1K 4.02M 100.00 ZFS directory
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 256
uid 0
gid 0
atime Wed Jun 21 14:39:12 2023
mtime Wed Jun 21 12:52:45 2023
3453 drwxr-xr-x 2 root root 20003 Jun 21 12:52 0 <---ashift=9
13166 drwxr-xr-x 2 root root 20003 Jun 21 12:52 0 <---ashift=12
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
test_ost_0 207T 556G 207T - - 0% 0% 1.00x ONLINE -
test_ost_1-4K 207T 698G 207T - - 0% 0% 1.00x ONLINE -
df -i /test_ost_1-4K /test_ost_0
Filesystem Inodes IUsed IFree IUse% Mounted on
test_ost_1-4K 354295106749 14401498 354280705251 1% /test_ost_1-4K
test_ost_0 373616575248 14401498 373602173750 1% /test_ost_0
df -B 1 /test_ost_1-4K /test_ost_0
Filesystem 1B-blocks Used Available Use% Mounted on
test_ost_1-4K 181990511607808 598790635520 181391720972288 1% /test_ost_1-4K
test_ost_0 191787165548544 502852747264 191284312801280 1% /test_ost_0
-
1B-blocks
- ashift 9 = 181990511607808 Bytes
- ashift 12 = 191787165548544 Bytes
- 181990511607808/191787165548544 = 94.89%
-
Used
- test script will write(each append about 16K data) 20001 x 32K file to single directory)
- ashift 9 = 502852747264 Bytes
- ashift 12 = 598790635520 Bytes
- 502852747264/598790635520 = 83.97%
-
dir dsize
- ashift 9 = 3.37 M
- ashift 12 = 12.9M
- 3.37/12.9 = 26.12%
-
file dsize
- As previous reply
zpool create cmd
zpool create test_ost_0 -O canmount=on -O xattr=sa -O acltype=posixacl -O recordsize=256k -o ashift=9 -o multihost=on raidz3 /dev/disk/by-id/scsi-xxxxxxx
zpool create test_ost_1-4K -O canmount=on -O xattr=sa -O acltype=posixacl -O recordsize=256k -o ashift=12 -o multihost=on raidz3 /dev/disk/by-id/scsi-xxxxxxx