ztest fails assertion in zio_write_gang_member_ready()
The following assertion failure is observed when running ztest (also mentioned in #12214):
BP_GET_NDVAS(zio->io_bp) <= BP_GET_NDVAS(pio->io_bp) (0x3 <= 0x2)
ASSERT at ../../module/zfs/zio.c:2728:zio_write_gang_member_ready()/sbin/ztest(+0x95f3)[0x55953d6495f3]
This is a member of a gang block, and it has 3 DVA’s, but its parent (the GBH, gang block header) has 2 DVA’s, which is unexpected (we can call this an “NDVAs inversion”).
The parent (GBH) BP has the X (encrypt) bit set (blk_prop high nibble = 0x8 (byteorder) + 0x2 (encrypt) ):
1(gdb) p *pio->io_bp
2$14 = {
3 blk_dva = {{
4 dva_word = {0x1, 0x8000000000090f4a}
5 }, {
6 dva_word = {0x1, 0x80000000000a0930}
7 }, {
8 dva_word = {0x0, 0x0}
9 }},
10 blk_prop = 0xa00b070200070007,
11 blk_birth = 0x33f,
12 }
The child BP does not have the X bit set (blk_prop high nibble = 0x8 (byteorder) ):
1(gdb) p *zio->io_bp
2$7 = {
3 blk_dva = {{
4 dva_word = {0x3, 0x54225}
5 }, {
6 dva_word = {0x3, 0x40f65}
7 }, {
8 dva_word = {0x3, 0x815fc}
9 }},
10 blk_prop = 0x8000070200020002,
11 blk_birth = 0x33f,
12}
Both the zio and its gang leader (GBH) requested 3 DVA’s:
1(gdb) p zio->io_gang_leader->io_prop
2$19 = {
3 zp_type = DMU_OT_OBJSET,
4 zp_level = 0x0,
5 zp_copies = 0x3,
6 zp_dedup = B_FALSE,
7 zp_dedup_verify = B_FALSE,
8 zp_nopwrite = B_FALSE,
9 zp_encrypt = B_TRUE,
10 zp_byteorder = B_TRUE,
11}
12(gdb) p zio->io_prop
13$20 = {
14 zp_type = DMU_OT_NONE,
15 zp_level = 0x0,
16 zp_copies = 0x3,
17 zp_dedup = B_FALSE,
18 zp_dedup_verify = B_FALSE,
19 zp_nopwrite = B_FALSE,
20 zp_encrypt = B_TRUE,
21 zp_byteorder = B_TRUE,
22}
Again this makes sense, because DMU_OT_OBJSET is not encrypted, so dmu_write_policy() allows it to request 3 DVA’s. However, the gang code assumes that if encryption was requested, the GBH will be encrypted, so zio_write_gang_block() has:
1 /*
2 * encrypted blocks need DVA[2] free so encrypted gang headers can't
3 * have a third copy.
4 */
5 gbh_copies = MIN(copies + 1, spa_max_replication(spa));
6 if (gio->io_prop.zp_encrypt && gbh_copies >= SPA_DVAS_PER_BP)
7 gbh_copies = SPA_DVAS_PER_BP - 1;
I think this code is overly restrictive. We only need to limit to 2 DVA’s if the data is actually encrypted (BP_IS_ENCRYPTED()). 3 DVA’s can be used if encryption is on for the dataset (zp_encrypt), but this particular block is merely authenticated or checksum-of-MACs. Per the comment in spa.h:
1* If this is a level 0 block with an unencrypted
2 * object type [e.g. DMU_OT_OBJSET], this block is authenticated with an HMAC (see
3 * BP_IS_AUTHENTICATED()). Otherwise (if level > 0), this bp will use the MAC
4 * words to store a checksum-of-MACs from the level below (see
5 * BP_HAS_INDIRECT_MAC_CKSUM()).
6 ...
7 * The salt and IV are not used for authenticated bps [e.g. DMU_OT_OBJSET] or bps
8 * with an indirect MAC checksum [e.g. indirect blocks], so these blocks can
9 * utilize all 3 DVAs and the full 64 bits for the fill count.
The zio_write_gang_block() code will only lead to a “NDVA’s inversion” when we have encryption, the block itself can have 3 DVAs (e.g. it’s an indirect block or an unencrypted block, which I think is just DMU_OT_OBJSET, MASTER_NODE, or ZVOL_PROP), and 3 copies were requested (because the copies property is at least 2).
This “NDVAs inversion” is unnecessary; we could use all 3 DVAs of the GBH even with encryption, for these block types. However, I would guess that if we ignore the assertion, everything else will work fine.
Suggested (untested) fix:
diff --git a/module/zfs/zio.c b/module/zfs/zio.c
index 6c6362963f0..90d6f8b4234 100644
--- a/module/zfs/zio.c
+++ b/module/zfs/zio.c
@@ -2859,7 +2859,7 @@ zio_write_gang_block(zio_t *pio, metaslab_class_t *mc)
* have a third copy.
*/
gbh_copies = MIN(copies + 1, spa_max_replication(spa));
- if (gio->io_prop.zp_encrypt && gbh_copies >= SPA_DVAS_PER_BP)
+ if (BP_IS_ENCRYPTED(bp) && gbh_copies >= SPA_DVAS_PER_BP)
gbh_copies = SPA_DVAS_PER_BP - 1;
int flags = METASLAB_HINTBP_FAVOR | METASLAB_GANG_HEADER;