zfs
zfs copied to clipboard
checksum/filetest_002_pos test can fail
System information
| Type | Version/Name |
|---|---|
| Distribution Name | CentOS Stream |
| Distribution Version | 8 |
| Kernel Version | 4.18.0-408.el8.x86_64 |
| Architecture | x86_64 |
| OpenZFS Version | e2ebf01941b04565deaca50521d1820f04dc588c |
Describe the problem you're observing
When looking at the buildbot, I noticed that checksum/filetest_002_pos had failed:
http://build.zfsonlinux.org/builders/CentOS%20Stream%208%20x86_64%20%28TEST%29/builds/5815
This implies there is a bug somewhere.
internal error: errors: List of errors unavailable: Invalid exchange is from zpool_get_errlog(). The kernel passed a checksum error to us in response to zpool status -P -v testpool. The ioctl used is ZFS_IOC_ERROR_LOG and we get the checksum error via errno.
There are two places in zfs_ioc_error_log() that can return an error. They are spa_open() and spa_get_errlog(). A quick test on my system shows that ZFS_IOC_POOL_STATS is done first when running that zpool command, and it indirectly calls spa_open_common(), so the spa_open() can be assumed to not have failed on the later ZFS_IOC_ERROR_LOG ioctl. That leaves spa_get_errlog().
Interestingly, there is a new SPA feature SPA_FEATURE_HEAD_ERRLOG that gives us a new code path from which we presumably received the error. That was introduced in 0409d3327371cef8a8c5886cb7530ded6f5f1091.
I have not debugged this further.
@TulsiJain @gamanakis You might be interested in this.
@behlendorf Is it possible to grep historical logs to see if that test ever failed on the buildbot before that patch was merged?
I've only recently started seeing this occasional test failure in the logs, so that's a reasonable guess. Normally, we only track failures observed in the last 30 days, that can be found here: http://build.zfsonlinux.org/known-issues.html
If I am reading that correctly, we are tracking failures in master, 2.0 and 2.1 separately and this issue only occurs in master. That patch should only be in master, which would be consistent with the guess that it is the cause of the regression.
I've been bothered by this test failure for quite some time. It's very sporadic.
Until this is resolved, we may want to add this test to our know list of flakey tests in tests/test-runner/bin/zts-report.py.in.
I will take a look.
@ryao @youzhongyang the head_errlog feature has seen some significant updates lately. Would you mind giving master (as of 82ac409acc77935ae366b800ee7cefb14939bbae) a try?