zfs
zfs copied to clipboard
Add option none to zfs redundant_metadata property
Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem.
Reviewed-by: Dipak Ghosh [email protected] Signed-off-by: Akash B [email protected]
Motivation and Context
By default, zfs takes extra copies of the metadata which leads to additional space usage and a total reduction in the number of inodes in a filesystem.
Description
This patch adds the new option "none" to zfs redundant_metadata property which can be used to disable the additional metadata copies.
How Has This Been Tested?
We have tested the above patch on VMs and on servers(HDD) along with lustre.
Tested this patch by:
Creating 25 Million directories when redundant_metadata is set to "all" and "none".
when,
redundant_metadata is set to "all" space USED is 591.3G in 1368.64 ops/sec
redundant_metadata is set to "none" space USED is 296.4G in 2966.02 ops/sec
Creating 25 Million 0k files when redundant_metadata is set to "all" and "none".
when,
redundant_metadata is set to "all" space USED is 20.1G in 10344.04 ops/sec
redundant_metadata is set to "none" space USED is 10.3G in 22050.56 ops/sec
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [x] Performance enhancement (non-breaking change which improves efficiency)
- [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [x] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [ ] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [ ] I have added tests to cover my changes.
- [x] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
The documentation of this property in man/man7/zfsprops.7 needs be updated as well.
Question: Does this "duplicate metadata on top of redundancy" also apply to metadata on special vdevs? If so, this would considerablly increase the viability of using metadata special vdevs!
yes, this would apply to special vdevs as well.
yes, this would apply to special vdevs as well.
Wow, thanks for that bit of info: As that means special tripple mirror special metadata vdevs are especially inefficient at the moment.
A few of the tests are failing in CI due to
/usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import_features_001_pos.ksh[60]: log_must[70]: log_pos: line 265: 781205: Abort(coredump)
module/zfs/dmu.c:2049:7: runtime error: index 195 out of bounds for type 'dmu_object_type_info_t [54]'
#0 0x7f157c809b98 in dmu_write_policy module/zfs/dmu.c:2049
#1 0x7f157c7b6fc2 in dbuf_write module/zfs/dbuf.c:5028
#2 0x7f157c7e23b2 in dbuf_sync_leaf module/zfs/dbuf.c:4498
#3 0x7f157c7e4964 in dbuf_sync_list module/zfs/dbuf.c:4537
#4 0x7f157c8e3085 in dnode_sync module/zfs/dnode_sync.c:850
#5 0x7f157c81a1c4 in dmu_objset_sync_dnodes module/zfs/dmu_objset.c:1564
#6 0x7f157c81a5a6 in sync_dnodes_task module/zfs/dmu_objset.c:1634
#7 0x7f157c62f1d6 in taskq_thread lib/libzpool/taskq.c:240
#8 0x7f157c628cb3 in zk_thread_wrapper lib/libzpool/kernel.c:89
#9 0x7f157ae1d608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477
#10 0x7f157ad40132 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f132)
ERROR: zhack -d /var/tmp/dev_import-test feature enable testpool1 com.test:xxx_unsup0 exited 262
NOTE: Performing test-fail callback (/usr/share/zfs/zfs-tests/callbacks/zfs_dbgmsg.ksh)
However, I'm not able to reproduce the above issue in my environment setup. Can anyone help me?
There are old and new object types which we need to be careful to check for in this macro. You'd want to do something like DMU_OT_IS_METADATA does and check DMU_OT_NEWTYPE. This is what's causing the test failure.
#define DMU_OT_IS_METADATA(ot) (((ot) & DMU_OT_NEWTYPE) ? \
(((ot) & DMU_OT_METADATA) != 0) : \
DMU_OT_IS_METADATA_IMPL(ot))
That said, to make sure we're targeting only the highest impact types I dumped the usage by-object-type, with zdb -bbb, from a couple production Lustre MDTs. As expected the biggest consumers were the DMU_OT_DNODE, DMU_OT_DIRECTORY_CONTENTS, and DMU_OT_SA objects. Everything else tagged as metadata was functionally insignificant. Making redundant copies of those costs us almost nothing. Given that, we can simplify the "some" policy to all metadata except these three types. i.e.
#define DMU_OT_IS_CRITICAL(ot) (((ot) & DMU_OT_NEWTYPE) ? \
(((ot) & DMU_OT_METADATA) != 0) : \
(DMU_OT_IS_METADATA_IMPL(ot) && \
(ot != DMU_OT_DNODE) && \
(ot != DMU_OT_DIRECTORY_CONTENTS) && \
(ot != DMU_OT_SA))))
With Data-on-MDS enabled, only ~13% of the total capacity is consumed by ZFS-tagged metadata on a MDT. So just to reiterate, the big win from the "some" option isn't capacity but create/unlink performance.
Blocks LSIZE PSIZE ASIZE avg comp %Total Type
58.0M 3.38T 3.38T 3.38T 59.6K 1.00 86.88 ZFS plain file
48.6M 70.5G 70.5G 388G 8K 1.00 9.76 System attributes
10.8M 561G 45.6G 91.3G 8.46K 12.30 2.29 ZFS directory
3.85M 62.2G 19.5G 38.9G 10.1K 3.19 0.98 DMU dnode
18.0K 2.11G 1.24G 3.73G 212K 1.70 0.09 SPA space map
<0.01 <everything else>
@behlendorf your macro can be implemented in terms of DMU_OT_IS_METADATA:
#define DMU_OT_IS_CRITICAL(ot) \
(DMU_OT_IS_METADATA(ot) && \
(ot) != DMU_OT_DNODE && \
(ot) != DMU_OT_DIRECTORY_CONTENTS && \
(ot) != DMU_OT_SA)
Although as I mentioned in my other comment, maybe we don't really care about keeping an extra copy of these "critical" objects, since there are still several unduplicated blocks that could take out access to the entire dataset (e.g. the first block of dnodes, the root directory contents, etc).
@ahrens @behlendorf If any corruption in the dataset has a chance of losing the whole dataset, regardless of the "critical" object types then it looks like distinguishing "critical" object types might not be necessary. As MOS anyway uses ZFS_REDUNDANT_METADATA_ALL, at least pool-wide metadata is still stored redundantly. So, I have updated the "none" policy for no extra metadata copies and the "some" policy to have extra copies for all metadata except these three types. i.e:
(DMU_OT_IS_METADATA(ot) && \
(ot) != DMU_OT_DNODE && \
(ot) != DMU_OT_DIRECTORY_CONTENTS && \
(ot) != DMU_OT_SA)
Also, I have modified the man page accordingly.
zfs-tests-functional test seems to be failing due to some other issues like many other CI jobs which are failing due to timeout (Error: The action has timed out.).
@akashb-22 do you have any benchmarks with redundant_metadata=most ?