zfs Posting checksum events fails with ENOMEM on FreeBSD

System information

Type	Version/Name
Distribution Name	FreeBSD
Distribution Version	14.0-CURRENT
Kernel Version	bef81bc0aef (main as of 3-April-2023)
Architecture	amd64
OpenZFS Version	zfs-2.1.99-FreeBSD_g57cfae4a2 zfs-kmod-2.1.99-FreeBSD_g431083f75

Describe the problem you're observing

On FreeBSD, devctl events must fit within a 1016 byte string buffer. If they are too large, they will be silently dropped and never sent to userland. ZFS's ereport.fs.zfs.checksum events are far too large: 1862 bytes in my test case. So they never get sent to userland. This means that zfsd is unable to degrade a vdev that produces too many checksum errors.

Describe how to reproduce the problem

vim /etc/kyua/kyua.conf # Set test_suites.FreeBSD.disks
cd /usr/tests/sys/cddl/zfs/tests/zfsd
sudo kyua debug zfsd_test:zfsd_degrade_001_pos

Optionally, run a dtrace script like this, which will show various ZFS events get queued for userland, but not any ereport.fs.zfs.checksum events.

sudo dtrace -i 'fbt:kernel:devctl_queue:entry {printf("%s", stringof(args[0]->dei_data));}'

Possible solutions

PR #14716 partially fixes the problem by making the report format less verbose. However, it isn't enough. We still need to trim another ~75 bytes from the report (the exact number depends on the size of the pool and vdev names). Here are some possibilities for how to fully fix it:

Combine the bad_cleared_histogram and bad_set_histogram into a single bad_bit_histogram. That would save about 150 bytes.
Widen the histogram buckets from one bit wide to two bits wide.
Simply remove the histogram fields. zpool-events(8) suggests that they are intended for use with IDE and parallel SCSI, which are obsolete.
Remove several other fields. But which ones? zio_err will probably always be zero. But I'm not sure which of the others can go.
Double the size of FreeBSD's buffer, at an uncertain performance cost

Current Report format

Here's an example of a current report, including the changes from PR #14716 .

 class=ereport.fs.zfs.checksum ena=96599319807790081 pool=testpool.1933 pool_guid=1236902063710799041 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=2774253874431514999 vdev_type=file vdev_path=/tmp/kyua.6Temlq/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=92124283803 vdev_delta_ts=46670 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=8090931855087882905 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=1028608 zio_size=512 zio_objset=0 zio_object=0 zio_level=0 zio_blkid=4 bad_ranges=0000000000000200 bad_ranges_min_gap=8 bad_range_sets=0000061f bad_range_clears=000001f4 bad_set_histogram=1719161c1c1c101618171a151a1a19161e1c171d1816161c191f1a18192117191c131d171b1613151a171419161a1b1319101b14171b18151e191a1b141a1c17 bad_cleared_histogram=06090a0808070a0b020609060506090a01090a050a0a0509070609080d050d0607080d060507080c04070807070a0608020c080c080908040808090a05090a07 time=00000016806477050000000604157480 eid=62

Apr 04 '23 23:04 asomers

I personally have found the contents of the histogram buckets useful a couple of times, for telling it was always a certain bit mangled, or that all the bits were wrong (e.g. the result was noise). So I'd prefer to not lose them.

Could we change the interface moving forward so we chunk up the events, rather than either bloating the buffer per message with uncertain implications or dropping fields? It seems like it'd be more robust to if we ever had occasion to add more data to those again, and failing to pass too-large messages in the "old" interface would be status quo ante anyway...

Apr 05 '23 01:04 rincebrain

What about removing all of the histogram stuff from the event, but adding a custom dtrace probe to get it? If checksum events are rare, then the histogram probably won't be useful anyway. And if checksum events are frequent, then one can start dtracing after noticing checksum events.

Apr 05 '23 03:04 asomers

Removing useful information because FreeBSD has a tiny buffer seems like a poor tradeoff, to me. What are you concerned about, with increasing the buffer limit? Is it a very hot path for some use cases? Is it fixed-size copies every time, so everyone gets a larger copy penalty?

Apr 05 '23 08:04 rincebrain

Yes, it's a fixed size, so increasing it for one event increases it for all. And it's already pretty big, compared to what most events need (usually fewer than 200 bytes). How useful is the histogram, really? You said that you noticed a stuck bit sometimes. What kind of hardware was that on?

Apr 05 '23 12:04 asomers

If valid values goes only to 64, you could divide by two the length of the histogram messages by returning each 0-63 value as a Base64 character.

Apr 05 '23 13:04 Precidata

The other "bad" idea that came to mind is, we have all these compression functions laying around...I wonder if any of them are sufficient to binpack this better, knowing a lot of them have fixed overheads...

Apr 05 '23 18:04 rincebrain

@asomers - could this issue be closed - cause PR #14716 is in?

Apr 13 '23 09:04 mcmilk

No, because that PR was only a partial solution. The messages are still too long.

Apr 13 '23 14:04 asomers

@rincebrain this is still a problem. Would you be open to adding an off-by-default sysctl knob, something like vfs.zfs.checksum_event_verbose? When off, ZFS would omit the checksum histograms.

Jul 10 '23 22:07 asomers

Oh, interesting, I had assumed this went in back when 14716 did.

Having read the code more carefully, embarassingly, I'm not actually sure I mind this any more, because what I was using is bad_set_bits/bad_cleared_bits, not the histogram.

I'm so sorry for derailing this over my incorrect perception.

Jul 10 '23 22:07 rincebrain

Great! I'll create a PR to remove the histogram fields. However, I now realize there's another problem too. For disks that aren't in RAIDZ arrays, ZFS adds a few other fields. They look like this cksum_actual=4d458e2ca9781347e3b80ea566e095a42df54a8fcb91eb058f365662cfb23c73 cksum_algorithm=sha256 cksum_expected=e0c327ff9d388f51695151ff6f51d253ff865d718d6a514980ff67658f8d4be1. And those extra fields also cause the event to overflow the buffer. May I eliminate those, too? cksum_actual, in particular, should almost never be useful. Apart from fletcher4, a single-bit change in the input will render the checksum completely different.

Jul 11 '23 16:07 asomers

SGTM. Hopefully if someone else is going to miss them, they'll speak up.

Jul 11 '23 19:07 rincebrain