zfs icon indicating copy to clipboard operation
zfs copied to clipboard

ztest fails assertion in zio_write_gang_member_ready()

Open ahrens opened this issue 3 years ago • 0 comments

The following assertion failure is observed when running ztest (also mentioned in #12214):

BP_GET_NDVAS(zio->io_bp) <= BP_GET_NDVAS(pio->io_bp) (0x3 <= 0x2)
ASSERT at ../../module/zfs/zio.c:2728:zio_write_gang_member_ready()/sbin/ztest(+0x95f3)[0x55953d6495f3]

This is a member of a gang block, and it has 3 DVA’s, but its parent (the GBH, gang block header) has 2 DVA’s, which is unexpected (we can call this an “NDVAs inversion”).

The parent (GBH) BP has the X (encrypt) bit set (blk_prop high nibble = 0x8 (byteorder) + 0x2 (encrypt) ):

1(gdb) p *pio->io_bp
2$14 = {
3  blk_dva = {{
4      dva_word = {0x1, 0x8000000000090f4a}
5    }, {
6      dva_word = {0x1, 0x80000000000a0930}
7    }, {
8      dva_word = {0x0, 0x0}
9    }},
10  blk_prop = 0xa00b070200070007,
11  blk_birth = 0x33f,
12 }

The child BP does not have the X bit set (blk_prop high nibble = 0x8 (byteorder) ):

1(gdb) p *zio->io_bp
2$7 = {
3  blk_dva = {{
4      dva_word = {0x3, 0x54225}
5    }, {
6      dva_word = {0x3, 0x40f65}
7    }, {
8      dva_word = {0x3, 0x815fc}
9    }},
10  blk_prop = 0x8000070200020002,
11  blk_birth = 0x33f,
12}

Both the zio and its gang leader (GBH) requested 3 DVA’s:

1(gdb) p zio->io_gang_leader->io_prop
2$19 = {
3  zp_type = DMU_OT_OBJSET,
4  zp_level = 0x0,
5  zp_copies = 0x3,
6  zp_dedup = B_FALSE,
7  zp_dedup_verify = B_FALSE,
8  zp_nopwrite = B_FALSE,
9  zp_encrypt = B_TRUE,
10  zp_byteorder = B_TRUE,
11}
12(gdb) p zio->io_prop
13$20 = {
14  zp_type = DMU_OT_NONE,
15  zp_level = 0x0,
16  zp_copies = 0x3,
17  zp_dedup = B_FALSE,
18  zp_dedup_verify = B_FALSE,
19  zp_nopwrite = B_FALSE,
20  zp_encrypt = B_TRUE,
21  zp_byteorder = B_TRUE,
22}

Again this makes sense, because DMU_OT_OBJSET is not encrypted, so dmu_write_policy() allows it to request 3 DVA’s. However, the gang code assumes that if encryption was requested, the GBH will be encrypted, so zio_write_gang_block() has:

1	/*
2	 * encrypted blocks need DVA[2] free so encrypted gang headers can't
3	 * have a third copy.
4	 */
5	gbh_copies = MIN(copies + 1, spa_max_replication(spa));
6	if (gio->io_prop.zp_encrypt && gbh_copies >= SPA_DVAS_PER_BP)
7		gbh_copies = SPA_DVAS_PER_BP - 1;

I think this code is overly restrictive. We only need to limit to 2 DVA’s if the data is actually encrypted (BP_IS_ENCRYPTED()). 3 DVA’s can be used if encryption is on for the dataset (zp_encrypt), but this particular block is merely authenticated or checksum-of-MACs. Per the comment in spa.h:

1* If this is a level 0 block with an unencrypted
2 * object type [e.g. DMU_OT_OBJSET], this block is authenticated with an HMAC (see
3 * BP_IS_AUTHENTICATED()). Otherwise (if level > 0), this bp will use the MAC
4 * words to store a checksum-of-MACs from the level below (see
5 * BP_HAS_INDIRECT_MAC_CKSUM()). 
6 ...
7 * The salt and IV are not used for authenticated bps [e.g. DMU_OT_OBJSET] or bps 
8 * with an indirect MAC checksum [e.g. indirect blocks], so these blocks can 
9 * utilize all 3 DVAs and the full 64 bits for the fill count.

The zio_write_gang_block() code will only lead to a “NDVA’s inversion” when we have encryption, the block itself can have 3 DVAs (e.g. it’s an indirect block or an unencrypted block, which I think is just DMU_OT_OBJSET, MASTER_NODE, or ZVOL_PROP), and 3 copies were requested (because the copies property is at least 2).

This “NDVAs inversion” is unnecessary; we could use all 3 DVAs of the GBH even with encryption, for these block types. However, I would guess that if we ignore the assertion, everything else will work fine.

Suggested (untested) fix:

diff --git a/module/zfs/zio.c b/module/zfs/zio.c
index 6c6362963f0..90d6f8b4234 100644
--- a/module/zfs/zio.c
+++ b/module/zfs/zio.c
@@ -2859,7 +2859,7 @@ zio_write_gang_block(zio_t *pio, metaslab_class_t *mc)
         * have a third copy.
         */
        gbh_copies = MIN(copies + 1, spa_max_replication(spa));
-       if (gio->io_prop.zp_encrypt && gbh_copies >= SPA_DVAS_PER_BP)
+       if (BP_IS_ENCRYPTED(bp) && gbh_copies >= SPA_DVAS_PER_BP)
                gbh_copies = SPA_DVAS_PER_BP - 1;

        int flags = METASLAB_HINTBP_FAVOR | METASLAB_GANG_HEADER;

ahrens avatar Dec 02 '22 17:12 ahrens