zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Add option none to zfs redundant_metadata property

Open akashb-22 opened this issue 3 years ago • 1 comments

Currently, additional/extra copies are created for metadata in addition to the redundancy provided by the pool(mirror/raidz/draid), due to this 2 times more space is utilized per inode and this decreases the total number of inodes that can be created in the filesystem. By setting redundant_metadata to none, no additional copies of metadata are created, hence can reduce the space consumed by the additional metadata copies and increase the total number of inodes that can be created in the filesystem.

Reviewed-by: Dipak Ghosh [email protected] Signed-off-by: Akash B [email protected]

Motivation and Context

By default, zfs takes extra copies of the metadata which leads to additional space usage and a total reduction in the number of inodes in a filesystem.

Description

This patch adds the new option "none" to zfs redundant_metadata property which can be used to disable the additional metadata copies.

How Has This Been Tested?

We have tested the above patch on VMs and on servers(HDD) along with lustre. Tested this patch by: Creating 25 Million directories when redundant_metadata is set to "all" and "none". when, redundant_metadata is set to "all" space USED is 591.3G in 1368.64 ops/sec redundant_metadata is set to "none" space USED is 296.4G in 2966.02 ops/sec

Creating 25 Million 0k files when redundant_metadata is set to "all" and "none". when, redundant_metadata is set to "all" space USED is 20.1G in 10344.04 ops/sec redundant_metadata is set to "none" space USED is 10.3G in 22050.56 ops/sec

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [x] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [x] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [x] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

akashb-22 avatar Jul 20 '22 11:07 akashb-22

The documentation of this property in man/man7/zfsprops.7 needs be updated as well.

ghost avatar Jul 20 '22 16:07 ghost

Question: Does this "duplicate metadata on top of redundancy" also apply to metadata on special vdevs? If so, this would considerablly increase the viability of using metadata special vdevs!

PrivatePuffin avatar Aug 24 '22 14:08 PrivatePuffin

yes, this would apply to special vdevs as well.

akashb-22 avatar Sep 02 '22 10:09 akashb-22

yes, this would apply to special vdevs as well.

Wow, thanks for that bit of info: As that means special tripple mirror special metadata vdevs are especially inefficient at the moment.

PrivatePuffin avatar Sep 02 '22 11:09 PrivatePuffin

A few of the tests are failing in CI due to

/usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import_features_001_pos.ksh[60]: log_must[70]: log_pos: line 265: 781205: Abort(coredump)
module/zfs/dmu.c:2049:7: runtime error: index 195 out of bounds for type 'dmu_object_type_info_t [54]'
    #0 0x7f157c809b98 in dmu_write_policy module/zfs/dmu.c:2049
    #1 0x7f157c7b6fc2 in dbuf_write module/zfs/dbuf.c:5028
    #2 0x7f157c7e23b2 in dbuf_sync_leaf module/zfs/dbuf.c:4498
    #3 0x7f157c7e4964 in dbuf_sync_list module/zfs/dbuf.c:4537
    #4 0x7f157c8e3085 in dnode_sync module/zfs/dnode_sync.c:850
    #5 0x7f157c81a1c4 in dmu_objset_sync_dnodes module/zfs/dmu_objset.c:1564
    #6 0x7f157c81a5a6 in sync_dnodes_task module/zfs/dmu_objset.c:1634
    #7 0x7f157c62f1d6 in taskq_thread lib/libzpool/taskq.c:240
    #8 0x7f157c628cb3 in zk_thread_wrapper lib/libzpool/kernel.c:89
    #9 0x7f157ae1d608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477
    #10 0x7f157ad40132 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f132)

ERROR: zhack -d /var/tmp/dev_import-test feature enable testpool1 com.test:xxx_unsup0 exited 262
NOTE: Performing test-fail callback (/usr/share/zfs/zfs-tests/callbacks/zfs_dbgmsg.ksh)

However, I'm not able to reproduce the above issue in my environment setup. Can anyone help me?

akashb-22 avatar Oct 07 '22 14:10 akashb-22

There are old and new object types which we need to be careful to check for in this macro. You'd want to do something like DMU_OT_IS_METADATA does and check DMU_OT_NEWTYPE. This is what's causing the test failure.

#define DMU_OT_IS_METADATA(ot) (((ot) & DMU_OT_NEWTYPE) ? \
        (((ot) & DMU_OT_METADATA) != 0) : \
        DMU_OT_IS_METADATA_IMPL(ot))

That said, to make sure we're targeting only the highest impact types I dumped the usage by-object-type, with zdb -bbb, from a couple production Lustre MDTs. As expected the biggest consumers were the DMU_OT_DNODE, DMU_OT_DIRECTORY_CONTENTS, and DMU_OT_SA objects. Everything else tagged as metadata was functionally insignificant. Making redundant copies of those costs us almost nothing. Given that, we can simplify the "some" policy to all metadata except these three types. i.e.

#define DMU_OT_IS_CRITICAL(ot) (((ot) & DMU_OT_NEWTYPE) ? \
        (((ot) & DMU_OT_METADATA) != 0) : \
        (DMU_OT_IS_METADATA_IMPL(ot) && \
        (ot != DMU_OT_DNODE) && \
        (ot != DMU_OT_DIRECTORY_CONTENTS) && \
        (ot != DMU_OT_SA))))

With Data-on-MDS enabled, only ~13% of the total capacity is consumed by ZFS-tagged metadata on a MDT. So just to reiterate, the big win from the "some" option isn't capacity but create/unlink performance.

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
 58.0M  3.38T   3.38T   3.38T   59.6K    1.00    86.88  ZFS plain file
 48.6M  70.5G   70.5G    388G      8K    1.00     9.76  System attributes
 10.8M   561G   45.6G   91.3G   8.46K   12.30     2.29  ZFS directory
 3.85M  62.2G   19.5G   38.9G   10.1K    3.19     0.98  DMU dnode
 18.0K  2.11G   1.24G   3.73G    212K    1.70     0.09  SPA space map
                                                 <0.01  <everything else>  

behlendorf avatar Oct 07 '22 20:10 behlendorf

@behlendorf your macro can be implemented in terms of DMU_OT_IS_METADATA:

#define DMU_OT_IS_CRITICAL(ot) \
        (DMU_OT_IS_METADATA(ot) && \
        (ot) != DMU_OT_DNODE && \
        (ot) != DMU_OT_DIRECTORY_CONTENTS && \
        (ot) != DMU_OT_SA)

Although as I mentioned in my other comment, maybe we don't really care about keeping an extra copy of these "critical" objects, since there are still several unduplicated blocks that could take out access to the entire dataset (e.g. the first block of dnodes, the root directory contents, etc).

ahrens avatar Oct 07 '22 21:10 ahrens

@ahrens @behlendorf If any corruption in the dataset has a chance of losing the whole dataset, regardless of the "critical" object types then it looks like distinguishing "critical" object types might not be necessary. As MOS anyway uses ZFS_REDUNDANT_METADATA_ALL, at least pool-wide metadata is still stored redundantly. So, I have updated the "none" policy for no extra metadata copies and the "some" policy to have extra copies for all metadata except these three types. i.e:

        (DMU_OT_IS_METADATA(ot) && \
        (ot) != DMU_OT_DNODE && \
        (ot) != DMU_OT_DIRECTORY_CONTENTS && \
        (ot) != DMU_OT_SA)

Also, I have modified the man page accordingly.

akashb-22 avatar Oct 08 '22 15:10 akashb-22

zfs-tests-functional test seems to be failing due to some other issues like many other CI jobs which are failing due to timeout (Error: The action has timed out.).

akashb-22 avatar Oct 13 '22 16:10 akashb-22

@akashb-22 do you have any benchmarks with redundant_metadata=most ?

CyberCr33p avatar Jan 22 '23 17:01 CyberCr33p