Posting checksum events fails with ENOMEM on FreeBSD
System information
| Type | Version/Name |
|---|---|
| Distribution Name | FreeBSD |
| Distribution Version | 14.0-CURRENT |
| Kernel Version | bef81bc0aef (main as of 3-April-2023) |
| Architecture | amd64 |
| OpenZFS Version | zfs-2.1.99-FreeBSD_g57cfae4a2 zfs-kmod-2.1.99-FreeBSD_g431083f75 |
Describe the problem you're observing
On FreeBSD, devctl events must fit within a 1016 byte string buffer. If they are too large, they will be silently dropped and never sent to userland. ZFS's ereport.fs.zfs.checksum events are far too large: 1862 bytes in my test case. So they never get sent to userland. This means that zfsd is unable to degrade a vdev that produces too many checksum errors.
Describe how to reproduce the problem
vim /etc/kyua/kyua.conf # Set test_suites.FreeBSD.disks
cd /usr/tests/sys/cddl/zfs/tests/zfsd
sudo kyua debug zfsd_test:zfsd_degrade_001_pos
Optionally, run a dtrace script like this, which will show various ZFS events get queued for userland, but not any ereport.fs.zfs.checksum events.
sudo dtrace -i 'fbt:kernel:devctl_queue:entry {printf("%s", stringof(args[0]->dei_data));}'
Possible solutions
PR #14716 partially fixes the problem by making the report format less verbose. However, it isn't enough. We still need to trim another ~75 bytes from the report (the exact number depends on the size of the pool and vdev names). Here are some possibilities for how to fully fix it:
- Combine the
bad_cleared_histogramandbad_set_histograminto a singlebad_bit_histogram. That would save about 150 bytes. - Widen the histogram buckets from one bit wide to two bits wide.
- Simply remove the histogram fields. zpool-events(8) suggests that they are intended for use with IDE and parallel SCSI, which are obsolete.
- Remove several other fields. But which ones?
zio_errwill probably always be zero. But I'm not sure which of the others can go. - Double the size of FreeBSD's buffer, at an uncertain performance cost
Current Report format
Here's an example of a current report, including the changes from PR #14716 .
class=ereport.fs.zfs.checksum ena=96599319807790081 pool=testpool.1933 pool_guid=1236902063710799041 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=2774253874431514999 vdev_type=file vdev_path=/tmp/kyua.6Temlq/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=92124283803 vdev_delta_ts=46670 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=8090931855087882905 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=1028608 zio_size=512 zio_objset=0 zio_object=0 zio_level=0 zio_blkid=4 bad_ranges=0000000000000200 bad_ranges_min_gap=8 bad_range_sets=0000061f bad_range_clears=000001f4 bad_set_histogram=1719161c1c1c101618171a151a1a19161e1c171d1816161c191f1a18192117191c131d171b1613151a171419161a1b1319101b14171b18151e191a1b141a1c17 bad_cleared_histogram=06090a0808070a0b020609060506090a01090a050a0a0509070609080d050d0607080d060507080c04070807070a0608020c080c080908040808090a05090a07 time=00000016806477050000000604157480 eid=62
I personally have found the contents of the histogram buckets useful a couple of times, for telling it was always a certain bit mangled, or that all the bits were wrong (e.g. the result was noise). So I'd prefer to not lose them.
Could we change the interface moving forward so we chunk up the events, rather than either bloating the buffer per message with uncertain implications or dropping fields? It seems like it'd be more robust to if we ever had occasion to add more data to those again, and failing to pass too-large messages in the "old" interface would be status quo ante anyway...
What about removing all of the histogram stuff from the event, but adding a custom dtrace probe to get it? If checksum events are rare, then the histogram probably won't be useful anyway. And if checksum events are frequent, then one can start dtracing after noticing checksum events.
Removing useful information because FreeBSD has a tiny buffer seems like a poor tradeoff, to me. What are you concerned about, with increasing the buffer limit? Is it a very hot path for some use cases? Is it fixed-size copies every time, so everyone gets a larger copy penalty?
Yes, it's a fixed size, so increasing it for one event increases it for all. And it's already pretty big, compared to what most events need (usually fewer than 200 bytes). How useful is the histogram, really? You said that you noticed a stuck bit sometimes. What kind of hardware was that on?
If valid values goes only to 64, you could divide by two the length of the histogram messages by returning each 0-63 value as a Base64 character.
The other "bad" idea that came to mind is, we have all these compression functions laying around...I wonder if any of them are sufficient to binpack this better, knowing a lot of them have fixed overheads...
@asomers - could this issue be closed - cause PR #14716 is in?
No, because that PR was only a partial solution. The messages are still too long.
@rincebrain this is still a problem. Would you be open to adding an off-by-default sysctl knob, something like vfs.zfs.checksum_event_verbose? When off, ZFS would omit the checksum histograms.
Oh, interesting, I had assumed this went in back when 14716 did.
Having read the code more carefully, embarassingly, I'm not actually sure I mind this any more, because what I was using is bad_set_bits/bad_cleared_bits, not the histogram.
I'm so sorry for derailing this over my incorrect perception.
Great! I'll create a PR to remove the histogram fields. However, I now realize there's another problem too. For disks that aren't in RAIDZ arrays, ZFS adds a few other fields. They look like this cksum_actual=4d458e2ca9781347e3b80ea566e095a42df54a8fcb91eb058f365662cfb23c73 cksum_algorithm=sha256 cksum_expected=e0c327ff9d388f51695151ff6f51d253ff865d718d6a514980ff67658f8d4be1. And those extra fields also cause the event to overflow the buffer. May I eliminate those, too? cksum_actual, in particular, should almost never be useful. Apart from fletcher4, a single-bit change in the input will render the checksum completely different.
SGTM. Hopefully if someone else is going to miss them, they'll speak up.