zfs icon indicating copy to clipboard operation
zfs copied to clipboard

checksum/filetest_002_pos test can fail

Open ryao opened this issue 3 years ago • 7 comments

System information

Type Version/Name
Distribution Name CentOS Stream
Distribution Version 8
Kernel Version 4.18.0-408.el8.x86_64
Architecture x86_64
OpenZFS Version e2ebf01941b04565deaca50521d1820f04dc588c

Describe the problem you're observing

When looking at the buildbot, I noticed that checksum/filetest_002_pos had failed:

http://build.zfsonlinux.org/builders/CentOS%20Stream%208%20x86_64%20%28TEST%29/builds/5815

This implies there is a bug somewhere.

internal error: errors: List of errors unavailable: Invalid exchange is from zpool_get_errlog(). The kernel passed a checksum error to us in response to zpool status -P -v testpool. The ioctl used is ZFS_IOC_ERROR_LOG and we get the checksum error via errno.

There are two places in zfs_ioc_error_log() that can return an error. They are spa_open() and spa_get_errlog(). A quick test on my system shows that ZFS_IOC_POOL_STATS is done first when running that zpool command, and it indirectly calls spa_open_common(), so the spa_open() can be assumed to not have failed on the later ZFS_IOC_ERROR_LOG ioctl. That leaves spa_get_errlog().

Interestingly, there is a new SPA feature SPA_FEATURE_HEAD_ERRLOG that gives us a new code path from which we presumably received the error. That was introduced in 0409d3327371cef8a8c5886cb7530ded6f5f1091.

I have not debugged this further.

ryao avatar Sep 14 '22 16:09 ryao

@TulsiJain @gamanakis You might be interested in this.

ryao avatar Sep 14 '22 16:09 ryao

@behlendorf Is it possible to grep historical logs to see if that test ever failed on the buildbot before that patch was merged?

ryao avatar Sep 14 '22 16:09 ryao

I've only recently started seeing this occasional test failure in the logs, so that's a reasonable guess. Normally, we only track failures observed in the last 30 days, that can be found here: http://build.zfsonlinux.org/known-issues.html

behlendorf avatar Sep 14 '22 18:09 behlendorf

If I am reading that correctly, we are tracking failures in master, 2.0 and 2.1 separately and this issue only occurs in master. That patch should only be in master, which would be consistent with the guess that it is the cause of the regression.

ryao avatar Sep 14 '22 18:09 ryao

I've been bothered by this test failure for quite some time. It's very sporadic.

youzhongyang avatar Sep 14 '22 19:09 youzhongyang

Until this is resolved, we may want to add this test to our know list of flakey tests in tests/test-runner/bin/zts-report.py.in.

behlendorf avatar Sep 16 '22 22:09 behlendorf

I will take a look.

gamanakis avatar Sep 18 '22 08:09 gamanakis

@ryao @youzhongyang the head_errlog feature has seen some significant updates lately. Would you mind giving master (as of 82ac409acc77935ae366b800ee7cefb14939bbae) a try?

gamanakis avatar May 05 '23 08:05 gamanakis