zfs Terrible performance of zfs recv due to multi-slot dnodes

Type Version/Name

Distribution Name Gentoo

Distribution Version Latest

Linux Kernel 5.4.81-gentoo-x86_64

Architecture x86_64

ZFS Version 2.0.0

Type	Version/Name
Distribution Name	Gentoo
Distribution Version	Latest
Linux Kernel	5.4.81-gentoo-x86_64
Architecture	x86_64
ZFS Version	2.0.0

Describe the problem you're observing

I am trying to replicated gentoo system dataset which I fully rebuild from sources.

I have 120 seconds zfs_txg_timeout with sync=disabled on all pools. Yet, zfs recv generates ~12 txgs/s, yet writes ~7MB/s according to 'zpool iostat', consumes ~100k/s of send stream according to 'pv -B100M'. 115MB of send stream is consumed in 12 minutes.

This is the first time I see so slow replication speed. Usually it is much faster, but usually I don't have many files to replicate. Disk wait 'zpool iostat -w archive' gives majority of 'disk_wait/write' in 524us bucket. 'disk_wait/read' has wide distribution from 131us to 134ms 'zpool status' gives no errors. I have already upgraded my pools to zfs 2.0, so cannot try with 0.8.6.

Describe how to reproduce the problem

Recompile entire system in gentoo 'emerge -e world', snapshot and send the diff:

zfs snapshot ssd/root@2020-12-16
zfs send -L -i ssd/root@2020-12-15 ssd/root@2020-12-16  | zfs recv archive/replica/root@2020-12-16

ssd is Samsung SSD which gives 500MiB/s send stream. archive is raidz of 4 seagate 4T disks + ssd l2arc

The stream size is ~7GB, contains 12476 change files as reported by

zfs diff ssd/root@2020-12-15 ssd/root@2020-12-16 | wc -l

CPU: 16 thread i9 9900KS, destination encrypted and gzip-8 compressed.

Include any warning/errors/backtraces from the system logs

No errors

Dec 16 '20 13:12 IvanVolosyuk

Sounds like something is causing it to do txg_wait_synced() (push out txg's) very frequently. Given that this is a simple send stream (only one snapshot, no encryption), I'm not sure what would cause that. How much free space is on the target pool/filesystem? You could check the stack trace where txg_wait_synced() is being called (e.g. with bpftrace).

Dec 16 '20 18:12 ahrens

NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
archive  8.25T  3.64T  4.61T        -         -     3%    44%  1.00x    ONLINE  -

I have plenty of free space in the destination. I have encryption enabled on the destination, but the send stream is not raw/encrypted one. I'll have a look at bpftrace a bit later. I have to recompile my kernel with extra tracing enabled for that.

Dec 17 '20 03:12 IvanVolosyuk

Here is bpftrace requested by @ahrens Hopefully it helps to figure out why so many TXGs generated.

Single snapshot send. The send stream to /dev/null gives 480Mb/s, so not bound by send.

zfs send -L -i @2020-12-17 ssd/root@send | pv -B100M | zfs recv archive/replica/root

60 seconds bpftrace, send stream consumed 41kb/s, 7Mb/s writes according to zpool iostat. ZFS 2.0.0 compiled in 5.4.80, destination encryption=on (aes-256-gcm), gzip-8.

$ timeout 60 bpftrace -e 'kprobe:txg_wait_synced { @[kstack] = count(); }'
Attaching 1 probe...
@[
    txg_wait_synced+1
    receive_object+424
    receive_writer_thread+473
    thread_generic_wrapper+111
    kthread+279
    ret_from_fork+31
]: 1
@[
    txg_wait_synced+1
    metaslab_enable+189
    vdev_autotrim_thread+908
    thread_generic_wrapper+111
    kthread+279
    ret_from_fork+31
]: 5
@[
    txg_wait_synced+1
    receive_object+2036
    receive_writer_thread+473
    thread_generic_wrapper+111
    kthread+279
    ret_from_fork+31
]: 114
@[
    txg_wait_synced+1
    receive_object+608
    receive_writer_thread+473
    thread_generic_wrapper+111
    kthread+279
    ret_from_fork+31
]: 164

Dec 17 '20 12:12 IvanVolosyuk

I decoded the references from offsets to line numbers, counts are for 60 second window:

receive_object+424 -> dmu_recv.c:1658 # 1 time

		/*
		 * The object requested is currently an interior slot of a
		 * multi-slot dnode. This will be resolved when the next txg
		 * is synced out, since the send stream will have told us
		 * to free this slot when we freed the associated dnode
		 * earlier in the stream.
		 */
		txg_wait_synced(dmu_objset_pool(rwa->os), 0);

receive_object+2036 -> dmu_recv.c:1548 # 114 times

	/*
	 * The dmu does not currently support decreasing nlevels
	 * or changing the number of dnode slots on an object. For
	 * non-raw sends, this does not matter and the new object
	 * can just use the previous one's nlevels. For raw sends,
	 * however, the structure of the received dnode (including
	 * nlevels and dnode slots) must match that of the send
	 * side. Therefore, instead of using dmu_object_reclaim(),
	 * we must free the object completely and call
	 * dmu_object_claim_dnsize() instead.
	 */
	if ((rwa->raw && drro->drr_nlevels < doi->doi_indirection) ||
	    dn_slots != doi->doi_dnodesize >> DNODE_SHIFT) {
		err = dmu_free_long_object(rwa->os, drro->drr_object);
		if (err != 0)
			return (SET_ERROR(EINVAL));

		txg_wait_synced(dmu_objset_pool(rwa->os), 0);
		*object_to_hold = DMU_NEW_OBJECT;
	}

receive_object+608 -> dmu_recv.c:1698 # 164 times

	/*
	 * If this is a multi-slot dnode there is a chance that this
	 * object will expand into a slot that is already used by
	 * another object from the previous snapshot. We must free
	 * these objects before we attempt to allocate the new dnode.
	 */
	if (dn_slots > 1) {
		boolean_t need_sync = B_FALSE;

		for (uint64_t slot = drro->drr_object + 1;
		    slot < drro->drr_object + dn_slots;
		    slot++) {
			dmu_object_info_t slot_doi;

			err = dmu_object_info(rwa->os, slot, &slot_doi);
			if (err == ENOENT || err == EEXIST)
				continue;
			else if (err != 0)
				return (err);

			err = dmu_free_long_object(rwa->os, slot);
			if (err != 0)
				return (err);

			need_sync = B_TRUE;
		}

		if (need_sync)
			txg_wait_synced(dmu_objset_pool(rwa->os), 0);
	}

Dec 17 '20 13:12 IvanVolosyuk

Interesting, so it looks like you are receiving multi-slot dnodes. Have you changed the dnodesize property from its default of legacy?

Dec 17 '20 19:12 ahrens

Damn it, I changed it to auto. My system doesn't use neither xattr nor acls. I think this change is widely recommended as a safe one. I changed the denodesize back to legacy. Any way to get back my send/recv performance?

Dec 17 '20 23:12 IvanVolosyuk

The comments seem to indicate that this (txg_wait_synced) should happen if a multi-slot dnode overwrites a single-slot dnode. So I'd think that once you change dnodesize from auto (1K) back to legacy (512B), future incrementals should have OK performance. But I'm not an expert in this area, maybe @nedbass has some insight.

Dec 18 '20 05:12 ahrens

Thanks for helping me figure this out! I managed to get 60x incremental send speed increase by making a clone and rsync'ing the data into it. I would imagine that the slow path in the code has a good reason and it is hard to speed it up. I wonder if it should be mentioned somewhere in the documentation about the peculiar performance implications of changing denodesize. I changed it to auto a few months ago and haven't noticed the implications until I did the massive changes. Someone else can shoot them-self in the foot the same way as I did.

Dec 18 '20 11:12 IvanVolosyuk

The comments seem to indicate that this (txg_wait_synced) should happen if a multi-slot dnode overwrites a single-slot dnode. So I'd think that once you change dnodesize from auto (1K) back to legacy (512B), future incrementals should have OK performance. But I'm not an expert in this area, maybe @nedbass has some insight.

Sorry, I have one question: to restore original performance, should dnodesize be changed on source, destination or both? Thanks.

Jan 08 '21 07:01 shodanshok

Changing on source didn't make much difference for me, changing on destination I think is irrelevant as it will not change already written dnode metadata. I have to 'rsync' the data to a copy of the dataset with dnodesize=legacy and start replication afresh to mitigate this issue for me.

Jan 08 '21 07:01 IvanVolosyuk

Just to be sure to understand it right: you rsynced to a clone datased on the source machine? Thanks.

Jan 08 '21 10:01 shodanshok

Strictly speaking I created an empty dataset on destination, rsynced all data to it locally from a bad dataset, snapshotted it. After that I sent it back to source. Once the new dataset was in sync on both sides - I setup the incremental replication from source to destination and destroyed the old bad source dataset.

Jan 08 '21 14:01 IvanVolosyuk

But what to do if I want to keep dnodesize=auto?

Feb 19 '21 08:02 woffs

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Feb 19 '22 23:02 stale[bot]

I think this issue illustrates rather nicely why and how is the stalebot actually harmful to the community...

here's an issue, which IMHO for most use-cases makes dnodesize=auto undeployable, yet the stalebot is done with it before any kind of conclusion was reached... it really adds further motivation to report issues next time... I mean no offense, but could you guys please reconsider stalebots engagement here in the OpenZFS project?

@behlendorf

thanks :)

Oct 24 '22 16:10 snajpa

Reopening and tagging as not-stale.

Oct 27 '22 22:10 behlendorf

I wonder if zfs recv could set a temporarysync=disabled on the dataset being received, to just having to sync it once for each snapshot (which AFAIK sadly can't be avoided as snapshots of a dataset need distinct createtxg) ?

Mar 15 '23 06:03 GregorKopka

All my datasets have sync=disabled, which doesn't help in this case.

Mar 15 '23 07:03 IvanVolosyuk

zfs-2.2.zip I would like to report this issue is still present on ZFS tag 2.2.0-rc4.

modinfo -F version zfs
2.2.0-rc4

To briefly explain my setup, I'm attempting to backup a Lustre MDT that uses ZFS as its backend OSD. The MDT is a ZFS mirror of 2xNVMes. I've seen one too many NVMe give up with little to no warning and I sleep better at night knowing a snapshot is being made to spinning rust to recover from.

The Lustre file system backs an academic HPC cluster, so user workflows (some of them badly behaved) can add/remove 10s of thousands if not millions of files before center staff can steer the user in the correct direction. These type of meta data ops I believe are the perfect storm to tickle the problem reported prior on this issue.

The Lustre docs recommend to use (or at least used to) dnodesize=auto so that the (also recommended) xattr=sa setting would reduce the chance of spill blocks to be allocated and so avoid extra IO contention.

To show the dnode size across the Lustre MDT ZFS dataset, I ran zdb -dd across the MDT dataset on the backup zpool to count all the dnode sizes:

zdb -dd ARX/mds1.swan.hcc.unl.edu/mds1pool0/Swan-MDT0000 0:-1:f |
tr -s ' ' |
cut -f7 -d' ' |
sort |
uniq -c |
sort -nr
144344599 1K
  3756195 512

so ~2.53% of the files have dnodesize=legacy still, not sure why - maybe an internal lustre MDT osd llog or related file, but clearly the majority of the files are set to 1K, matching the current size for dnodesize=auto setting from the source code. The setting was used since the zpool was created: zpool create ... -O dnodesize=auto -O xattr=sa ...

My issue presently is I'm not able to get a daily snapshot received to the backup zpool as it takes nearly a week to complete. The receive operation on the backup zpool is basically IO blocking from seemingly constant txg_wait_synced() calls.

For instance, here is the zstreamdump stats for a problematic daily snapshot from the production/origin zpool, it takes less than 11 minutes to send ~53GiB through zstreamdump:

time (/sbin/zfs send -i mds1pool0/Swan-MDT0000@daily_2023-09-15_00.00.00--7d mds1pool0/Swan-MDT0000@daily_2023-09-16_00.00.00--7d | zstreamdump)
BEGIN record
        hdrtype = 1
        features = 800004
        magic = 2f5bacbac
        creation_time = 65053651
        type = 2
        flags = 0xc
        toguid = b2421296ce9d7e3b
        fromguid = 24b5ab32c15232ed
        toname = mds1pool0/Swan-MDT0000@daily_2023-09-16_00.00.00--7d
        payloadlen = 0
END checksum = 5a170a968aede897/50d1f20389888d4a/6b2731cdf6424c8d/ff0d4088bf985300
SUMMARY:
        Total DRR_BEGIN records = 1 (0 bytes)
        Total DRR_END records = 1 (0 bytes)
        Total DRR_OBJECT records = 9491662 (3801688592 bytes)
        Total DRR_FREEOBJECTS records = 1900118 (0 bytes)
        Total DRR_WRITE records = 4111333 (40234523136 bytes)
        Total DRR_WRITE_BYREF records = 0 (0 bytes)
        Total DRR_WRITE_EMBEDDED records = 0 (0 bytes)
        Total DRR_FREE records = 16164165 (0 bytes)
        Total DRR_SPILL records = 3277086 (1678427648 bytes)
        Total records = 34944366
        Total payload size = 45714639376 (0xaa4ce0a10)
        Total header overhead = 10902642192 (0x289d91e10)
        Total stream length = 56617281568 (0xd2ea72820)

real    10m38.535s
user    0m19.931s
sys     9m18.427s

I setup a script on the receiving zpool to nab some /proc/spl/kstat/zfs/ files, call out to pidstat along with some bpftrace probes to see where the time was being spent for the above type of incremental stream, then finally a call to collect the same /proc/spl/kstat/zfs/ files after. One interesting finding from the dmu_tx proc file:

# diff -U0 *.dmu_tx
--- 1695250624.dmu_tx   2023-09-20 17:57:04.517341822 -0500
+++ 1695251824.dmu_tx   2023-09-20 18:17:04.814859970 -0500
@@ -1 +1 @@
-8 1 0x01 13 3536 7864007453667778 7884303621045915
+8 1 0x01 13 3536 7864007453667778 7885503921710390
@@ -3 +3 @@
-dmu_tx_assigned                 4    325638007
+dmu_tx_assigned                 4    329249935
echo 329249935-325638007 | bc -l
3611928

My bpftrace call from the script:

timeout 1200 bpftrace -e 'kprobe:txg_wait_synced,kprobe:dmu_tx_assign { @[kstack] = count(); }' > bpftrace

If I collect the counts for all the dmu_tx_assign call stacks from the bpftrace output:

# sed -n '/dmu_tx_assign/,/^\]:/{//,/$/p}' bpftrace |
grep '^]:' |
awk 'BEGIN {T=0} {T+=$2} END {print "T is " T}'
T is 3611878

So 3611928 > 3611878 is close enough or makes sense since the dmu_tx proc file dumps happened prior to and after the bpftrace run. In any case, it seems like the receive_writer -> receive_object / dmu_free_long_object / dmu_free_long_range/ flush_write_batch is where the sync load is coming from, the DRR_FREE/DRR_FREEOBJECTS records?

Since the MDT zpool is an NVMe mirrored setup, should I be concerned about the extra potential IO if I set dnodesize=legacy and just deal with spill blocks so I can get snapshots to be received without all the sync load on the receiving zpool by avoiding the multi-slot dnodes?

Also since I'm in a bind and getting further behind on getting an incremental snapshot through, what harm would there be if I rebuilt zfs on the receiving zpool node with all the txg_wait_synced() calls commented out in module/zfs/dmu_recv.c:receive_handle_existing_object()? All the inodes should be 1K, so I'm not sure why all the per-txg syncs are needed, but would appreciate an opinion. My only other option is to do full replication streams to a file to use in case of emergency, which also seems like a lot of data to constantly throw over the network from the production Lustre MDT.

Sep 21 '23 20:09 ghjs

A "feature" which requires txg_wait_synced this much should IMO be rolled back (as much as it can be, due to ondisk changes), as such a "feature" is deffective by design. Yes, there might be users of it, who don't require send/recv, but from the other side, as send/recv is a hallmark feature of ZFS for most of its users, multi-slot dnodes at the very least shouldn't be the default, if they're this much broken (still).

Apr 30 '25 16:04 snajpa

And this feature is suggested by various guides.

May 01 '25 07:05 IvanVolosyuk

I believe I am affected by this. I have set dnodesize=auto on absolutely everything. Why? Because:

https://openzfs.github.io/openzfs-docs/Getting%20Started/Debian/Debian%20Bookworm%20Root%20on%20ZFS.html

May 24 '25 10:05 darrenfreeman

I can confirm that the receive performance is no better on version 2.3.2-1 from debian-testing.

One badly affected dataset is producing 9 txg/s and maximum IOPS on spinning rust, with only a few KB/s received, regardless of whether it's the above version, or 2.1.11-1+deb12u1 from debian-stable.

May 25 '25 14:05 darrenfreeman

I decoded the references from offsets to line numbers, counts are for 60 second window: txg_wait_synced(dmu_objset_pool(rwa->os), 0);

@amotin Looking at #17434 do you think a similar optimization could be applied to some these cases?

Jun 12 '25 15:06 scineram

I decoded the references from offsets to line numbers, counts are for 60 second window: txg_wait_synced(dmu_objset_pool(rwa->os), 0);

Looking at #17434 do you think a similar optimization could be applied to some these cases?

Yes, quite likely. I did a quick search while working on it. Some of txg_wait_synced(, 0) calls do expect to include deferred frees to get maximum stability and accurate space accounting, but I don't think receive is one of them. Though I haven't looked that way close.

Jun 12 '25 16:06 amotin

Yes, quite likely. I did a quick search while working on it. Some of txg_wait_synced(, 0) calls do expect to include deferred frees to get maximum stability and accurate space accounting, but I don't think receive is one of them. Though I haven't looked that way close.

@amotin Do you think would be possible for receive to skip all/most of such checks? In the past I was hit by zfs_per_txg_dirty_frees_percent related storms on a receive host, and I remember long receives for small deltas.

Jun 13 '25 09:06 shodanshok

That still implies multi-slot dnodes shouldn't be the default as the txg sync loop is still going to be there. Maybe not as bad, but still bad.

Jun 13 '25 12:06 snajpa