zfs icon indicating copy to clipboard operation
zfs copied to clipboard

CHKSUM-errors on a just attached disk

Open rstrlcpy opened this issue 5 years ago • 18 comments

System information

Type Version/Name
Distribution Name Ubuntu Server
Distribution Version 18.04.2 LTS
Linux Kernel 4.15.0-48
Architecture x86_64
ZFS Version 0.7.5-1ubuntu16.4
SPL Version 0.7.5-1ubuntu1

Describe the problem you're observing

The issue initially was found on a Nexenta's system, but can be successfully reproduced on linux. We thought the issue is a regression after integration of Sequential-resilvering, but it can be reproduced with disabled seq-resilvering, but not so quickly. So looks like this an old general issue.

There is a code, that handles non-finished resilvering, but it seems it not fully covers it. So that scrub can issue READ zios to just attached disk, that causes the CHKSUM errors.

Describe how to reproduce the problem

Create 2 RAM disks (the size of each disk is 200MB)

modprobe brd rd_nr=2 rd_size=204800

Create a mirrored pool on the disks

zpool create -o cachefile=none rtest mirror /dev/ram0 /dev/ram1

Run the following 3 scripts in parallel.

while true; do
        dd if=/dev/urandom of=/rtest/f bs=1M count=90
done
while true; do
        zpool scrub rtest
done
while true; do
        zpool detach rtest ram1
        zpool attach rtest ram0 /dev/ram1
        sleep 10
done

And monitor the status of the pool

zpool status rtest 1

Include any warning/errors/backtraces from the system logs

  pool: rtest
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Mon Apr 29 20:39:50 2019
        87.5M scanned out of 88.0M at 87.5M/s, 0h0m to go
        0B repaired, 99.43% done
config:

        NAME        STATE     READ WRITE CKSUM
        rtest       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ram0    ONLINE       0     0     0
            ram1    ONLINE       0     0    10

errors: No known data errors

rstrlcpy avatar Apr 29 '19 21:04 rstrlcpy

Hi Roman, I might have another reproducer for a similar issue. Can you add the ereport?

richardelling avatar Apr 29 '19 21:04 richardelling

erepot? What is it?

rstrlcpy avatar Apr 30 '19 13:04 rstrlcpy

FMA ereport. On illumos fmdump -eV or ZoL zpool events -v

richardelling avatar Apr 30 '19 18:04 richardelling

# zpool status rtest 1
  pool: rtest
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 0h0m with 0 errors on Tue Apr 30 19:29:46 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rtest       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ram0    ONLINE       0     0     0
            ram1    ONLINE       0     0    47

errors: No known data errors

There are no errors in the report:

click me

TIME                           CLASS
Apr 30 2019 19:29:42.878638170 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a226 0x345ef45a 
        eid = 0x205

Apr 30 2019 19:29:42.878638170 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2473"
        history_internal_name = "scan setup"
        history_txg = 0x9a9
        history_time = 0x5cc8a226
        time = 0x5cc8a226 0x345ef45a 
        eid = 0x206

Apr 30 2019 19:29:43.038657548 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9aa
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x24dde0c 
        eid = 0x207

Apr 30 2019 19:29:43.038657548 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x24dde0c 
        eid = 0x208

Apr 30 2019 19:29:43.090663852 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x5676bac 
        eid = 0x209

Apr 30 2019 19:29:43.090663852 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2476"
        history_internal_name = "scan setup"
        history_txg = 0x9ac
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x5676bac 
        eid = 0x20a

Apr 30 2019 19:29:43.270685676 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9ad
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x102255ec 
        eid = 0x20b

Apr 30 2019 19:29:43.270685676 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x102255ec 
        eid = 0x20c

Apr 30 2019 19:29:43.322691981 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x133be38d 
        eid = 0x20d

Apr 30 2019 19:29:43.322691981 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2479"
        history_internal_name = "scan setup"
        history_txg = 0x9af
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x133be38d 
        eid = 0x20e

Apr 30 2019 19:29:43.490712349 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9b0
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x1d3fad1d 
        eid = 0x20f

Apr 30 2019 19:29:43.490712349 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x1d3fad1d 
        eid = 0x210

Apr 30 2019 19:29:43.554720109 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x21105b6d 
        eid = 0x211

Apr 30 2019 19:29:43.562721079 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2486"
        history_internal_name = "scan setup"
        history_txg = 0x9b6
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x218a7137 
        eid = 0x212

Apr 30 2019 19:29:43.742742902 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9b7
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x2c455b76 
        eid = 0x213

Apr 30 2019 19:29:43.742742902 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x2c455b76 
        eid = 0x214

Apr 30 2019 19:29:43.766745812 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x2db39cd4 
        eid = 0x215

Apr 30 2019 19:29:43.766745812 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2489"
        history_internal_name = "scan setup"
        history_txg = 0x9b9
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x2db39cd4 
        eid = 0x216

Apr 30 2019 19:29:43.958769091 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9ba
        history_time = 0x5cc8a227
        time = 0x5cc8a227 0x3925a7c3 
        eid = 0x217

Apr 30 2019 19:29:43.958769091 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a227 0x3925a7c3 
        eid = 0x218

Apr 30 2019 19:29:44.002774424 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0x2a5598 
        eid = 0x219

Apr 30 2019 19:29:44.002774424 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2492"
        history_internal_name = "scan setup"
        history_txg = 0x9bc
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0x2a5598 
        eid = 0x21a

Apr 30 2019 19:29:44.222801136 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9bd
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0xd47acf0 
        eid = 0x21b

Apr 30 2019 19:29:44.222801136 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0xd47acf0 
        eid = 0x21c

Apr 30 2019 19:29:44.250804535 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0xef2f937 
        eid = 0x21d

Apr 30 2019 19:29:44.250804535 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2495"
        history_internal_name = "scan setup"
        history_txg = 0x9bf
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0xef2f937 
        eid = 0x21e

Apr 30 2019 19:29:44.438827360 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9c0
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0x1a27f960 
        eid = 0x21f

Apr 30 2019 19:29:44.438827360 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0x1a27f960 
        eid = 0x220

Apr 30 2019 19:29:44.474831731 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0x1c4d5b73 
        eid = 0x221

Apr 30 2019 19:29:44.474831731 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2498"
        history_internal_name = "scan setup"
        history_txg = 0x9c2
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0x1c4d5b73 
        eid = 0x222

Apr 30 2019 19:29:44.662854556 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9c3
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0x27825b9c 
        eid = 0x223

Apr 30 2019 19:29:44.662854556 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0x27825b9c 
        eid = 0x224

Apr 30 2019 19:29:44.690857955 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0x292da7e3 
        eid = 0x225

Apr 30 2019 19:29:44.690857955 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2501"
        history_internal_name = "scan setup"
        history_txg = 0x9c5
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0x292da7e3 
        eid = 0x226

Apr 30 2019 19:29:44.874880295 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9c6
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0x34259d27 
        eid = 0x227

Apr 30 2019 19:29:44.874880295 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0x34259d27 
        eid = 0x228

Apr 30 2019 19:29:44.930887093 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a228 0x377c35b5 
        eid = 0x229

Apr 30 2019 19:29:44.930887093 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2506"
        history_internal_name = "scan setup"
        history_txg = 0x9ca
        history_time = 0x5cc8a228
        time = 0x5cc8a228 0x377c35b5 
        eid = 0x22a

Apr 30 2019 19:29:45.122910444 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9cb
        history_time = 0x5cc8a229
        time = 0x5cc8a229 0x75376ec 
        eid = 0x22b

Apr 30 2019 19:29:45.122910444 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a229 0x75376ec 
        eid = 0x22c

Apr 30 2019 19:29:45.170916288 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a229 0xa2ff9c0 
        eid = 0x22d

Apr 30 2019 19:29:45.170916288 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2509"
        history_internal_name = "scan setup"
        history_txg = 0x9cd
        history_time = 0x5cc8a229
        time = 0x5cc8a229 0xa2ff9c0 
        eid = 0x22e

Apr 30 2019 19:29:45.390943072 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9ce
        history_time = 0x5cc8a229
        time = 0x5cc8a229 0x174d5160 
        eid = 0x22f

Apr 30 2019 19:29:45.390943072 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a229 0x174d5160 
        eid = 0x230

Apr 30 2019 19:29:45.470952811 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a229 0x1c122b6b 
        eid = 0x231

Apr 30 2019 19:29:45.470952811 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2516"
        history_internal_name = "scan setup"
        history_txg = 0x9d4
        history_time = 0x5cc8a229
        time = 0x5cc8a229 0x1c122b6b 
        eid = 0x232

Apr 30 2019 19:29:45.670977160 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9d5
        history_time = 0x5cc8a229
        time = 0x5cc8a229 0x27fe4c88 
        eid = 0x233

Apr 30 2019 19:29:45.670977160 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a229 0x27fe4c88 
        eid = 0x234

Apr 30 2019 19:29:45.694980082 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a229 0x296c8df2 
        eid = 0x235

Apr 30 2019 19:29:45.694980082 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2519"
        history_internal_name = "scan setup"
        history_txg = 0x9d7
        history_time = 0x5cc8a229
        time = 0x5cc8a229 0x296c8df2 
        eid = 0x236

Apr 30 2019 19:29:45.883002970 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9d8
        history_time = 0x5cc8a229
        time = 0x5cc8a229 0x34a18e5a 
        eid = 0x237

Apr 30 2019 19:29:45.883002970 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a229 0x34a18e5a 
        eid = 0x238

Apr 30 2019 19:29:45.939009788 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a229 0x37f826fc 
        eid = 0x239

Apr 30 2019 19:29:45.939009788 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2525"
        history_internal_name = "scan setup"
        history_txg = 0x9dd
        history_time = 0x5cc8a229
        time = 0x5cc8a229 0x37f826fc 
        eid = 0x23a

Apr 30 2019 19:29:46.123032209 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9de
        history_time = 0x5cc8a22a
        time = 0x5cc8a22a 0x7555291 
        eid = 0x23b

Apr 30 2019 19:29:46.123032209 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a22a 0x7555291 
        eid = 0x23c

Apr 30 2019 19:29:46.151035622 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a22a 0x9009ee6 
        eid = 0x23d

Apr 30 2019 19:29:46.151035622 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2528"
        history_internal_name = "scan setup"
        history_txg = 0x9e0
        history_time = 0x5cc8a22a
        time = 0x5cc8a22a 0x9009ee6 
        eid = 0x23e

Apr 30 2019 19:29:46.335058054 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9e1
        history_time = 0x5cc8a22a
        time = 0x5cc8a22a 0x13f89486 
        eid = 0x23f

Apr 30 2019 19:29:46.335058054 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a22a 0x13f89486 
        eid = 0x240

Apr 30 2019 19:29:46.371062443 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a22a 0x161df6ab 
        eid = 0x241

Apr 30 2019 19:29:46.371062443 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "func=1 mintxg=0 maxtxg=2531"
        history_internal_name = "scan setup"
        history_txg = 0x9e3
        history_time = 0x5cc8a22a
        time = 0x5cc8a22a 0x161df6ab 
        eid = 0x242

Apr 30 2019 19:29:46.563085850 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "zol-devel"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x9e4
        history_time = 0x5cc8a22a
        time = 0x5cc8a22a 0x2190021a 
        eid = 0x243

Apr 30 2019 19:29:46.563085850 sysevent.fs.zfs.scrub_finish
        version = 0x0
        class = "sysevent.fs.zfs.scrub_finish"
        pool = "rtest"
        pool_guid = 0x4888d3300934ecba
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5cc8a22a 0x2190021a 
        eid = 0x244

rstrlcpy avatar Apr 30 '19 19:04 rstrlcpy

!?! every checksum error should also generate an ereport. Does your build include 2bbec1c91 ?

richardelling avatar Apr 30 '19 21:04 richardelling

It seem the change is part of 0.8.x release, but the test system is 0.7.5. Will try to build a 0.8.x release and repeat the test

rstrlcpy avatar May 01 '19 14:05 rstrlcpy

I've built 0.8.0-rc4. The same issue and there are no any related events.

rstrlcpy avatar May 06 '19 17:05 rstrlcpy

I can reproduce @Ramzec results with 0.8.0-rc4 on an Ubuntu 18 VM. I also didn't see any checksum events:

hutter@ubuntu18:~/zfs$ ./cmd/zpool/zpool status
  pool: rtest
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Feb 21 20:18:09 2019
	87.3M scanned at 87.3M/s, 22.2M issued at 22.2M/s, 88.5M total
	0B repaired, 25.02% done, 0 days 00:00:02 to go
config:

	NAME        STATE     READ WRITE CKSUM
	rtest       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    ram0    ONLINE       0     0   162
	    ram1    ONLINE       0     0    54

errors: No known data errors
hutter@ubuntu18:~/zfs$ ./cmd/zpool/zpool events
TIME                           CLASS
cannot get event: permission denied
hutter@ubuntu18:~/zfs$ sudo ./cmd/zpool/zpool events
[sudo] password for hutter: 
TIME                           CLASS
dropped 285 events
Feb 21 2019 20:18:22.900000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:22.900000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:22.924000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:22.924000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.036000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.036000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.056000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.056000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.116000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.116000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.140000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.140000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.244000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.244000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.288000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.292000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.428000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.428000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.468000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.468000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.572000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.572000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.600000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.600000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.720000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.720000000 sysevent.fs.zfs.scrub_finish
Feb 21 2019 20:18:23.764000000 sysevent.fs.zfs.scrub_start
Feb 21 2019 20:18:23.764000000 sysevent.fs.zfs.history_event
Feb 21 2019 20:18:23.908000000 sysevent.fs.zfs.history_event
...

tonyhutter avatar May 14 '19 00:05 tonyhutter

Here is what is known by now: scrub might issue ZIO to right now attached disk, that causes the described CHKSUM errors. To solve this I implemented the following fix. With the fix ZFS will not start ZIO-scrub for a vdev, that has missing data

diff --git a/usr/src/uts/common/fs/zfs/vdev_mirror.c b/usr/src/uts/common/fs/zfs/vdev_mirror.c
index 4a869c9ea1..4843d2f7c8 100644
--- a/usr/src/uts/common/fs/zfs/vdev_mirror.c
+++ b/usr/src/uts/common/fs/zfs/vdev_mirror.c
@@ -445,6 +445,14 @@ vdev_mirror_io_start(zio_t *zio)
                                        continue;
                                }

+                               /*
+                                * Possible the VDEV was attached in near past
+                                * and resilvering is not finished yet for it,
+                                * so we need to skip to avoid CHKSUM errors
+                                */
+                               if (vdev_dtl_contains(mc->mc_vd, DTL_MISSING, zio->io_txg, 1))
+                                       continue;
+
                                if (spec_case && c != target)
                                        continue;

Also I found two code-blocks, where ZFS checks vdevs and if some of them is required resilvering. The first one switches code-path from SCRUB to RESILVER if a vdev is required resilvering. The another one doesn't do that switch. The following change fixes the issue:

diff --git a/usr/src/uts/common/fs/zfs/dsl_scan.c b/usr/src/uts/common/fs/zfs/dsl_scan.c
index f39afbb4cb..6950822846 100644
--- a/usr/src/uts/common/fs/zfs/dsl_scan.c
+++ b/usr/src/uts/common/fs/zfs/dsl_scan.c
@@ -470,6 +474,7 @@ dsl_scan_setup_sync(void *arg, dmu_tx_t *tx)
                    &scn->scn_phys.scn_min_txg, &scn->scn_phys.scn_max_txg)) {
                        spa_event_notify(spa, NULL, NULL,
                            ESC_ZFS_RESILVER_START);
+                       scn->scn_phys.scn_func = POOL_SCAN_RESILVER;
                } else {
                        spa_event_notify(spa, NULL, NULL, ESC_ZFS_SCRUB_START);
                }
@@ -506,7 +511,8 @@ dsl_scan_setup_sync(void *arg, dmu_tx_t *tx)

        spa_history_log_internal(spa, "scan setup", tx,
            "func=%u mintxg=%llu maxtxg=%llu",
-           *funcp, scn->scn_phys.scn_min_txg, scn->scn_phys.scn_max_txg);
+           scn->scn_phys.scn_func, scn->scn_phys.scn_min_txg,
+           scn->scn_phys.scn_max_txg);
 }

 /* ARGSUSED */

Unfortunately the changes didn't help. I still saw CHKSUM errors, but on another VDEV. Later I found that all CHKSUM-errors are related to blocks, that were created at the same TXG that is syncing now.

Finally I implemented the following solution:

diff --git a/usr/src/uts/common/fs/zfs/dsl_scan.c b/usr/src/uts/common/fs/zfs/dsl_scan.c
index f39afbb4cb..6950822846 100644
--- a/usr/src/uts/common/fs/zfs/dsl_scan.c
+++ b/usr/src/uts/common/fs/zfs/dsl_scan.c
@@ -428,6 +428,10 @@ dsl_scan_setup_check(void *arg, dmu_tx_t *tx)
        if (dsl_scan_is_running(scn))
                return (SET_ERROR(EBUSY));

+       if (dmu_tx_is_syncing(tx) && scn->scn_restart_txg != 0 &&
+           scn->scn_restart_txg > tx->tx_txg)
+               return (SET_ERROR(EAGAIN));
+
        return (0);
 }

With the change SCRUB started by a user will be rescheduled if a vdev is required resilvering.

With the last change my test VM was running more than 2 days without any CHKSUM errors.

Any idea, thoughts...

rstrlcpy avatar May 14 '19 20:05 rstrlcpy

@tonyhutter pointed out that I may had a similar problem, first described on the mailing list. I am using CentOS 7.6 with kernel 3.10.0-957.12.1.el7.x86_64 and ZFS/SPL version 0.7.13-1

Yesterday I replaced a failed disk, with path pci-0000:02:00.1-ata-2.0. As this chassis in not hotplug, I powered off the machine and replaced the disk. On server restart, having autoreplace=on, zed immediately replaced the disk and a resilver started.

A zpool status issued just after booting shows 1.15K CKSUM error on the replaced disk - pci-0000:02:00.1-ata-2.0, which was marked as DEGRADED. This surprised me: the disk was brand new, with absolutely no zfs labels, so how/why zfs tried to access it?

A subsequent scrub found no error, but CKSUM remained at 1.15K until I issued a "zpool clear tank". It seems as if zfs tried to read from the replaced disk (during pool import?), found it "corrupted" (ie: it contains no valid data) and marked it as degraded.

I failed to replicate the issue in a VM, though; maybe it is timing sensitive (ie: there is a race condition)? Thanks.

shodanshok avatar May 16 '19 17:05 shodanshok

I'm seeing very similar behavior since upgrading to 0.8.2-1. One of my zpools with <1 year old 10TB drives all reported checksum errors right after the upgrade. (No errors ever before 0.8.2-1.) I also was experiencing the restarting resilver issue. Doing some digging, I saw a tip about stopping zed. Once I stopped zed, all my problems went away.

` pool: tank state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Oct 25 21:28:17 2019 72.8G scanned at 9.10G/s, 64.9G issued at 8.11G/s, 29.1T total 322M resilvered, 0.22% done, 0 days 01:01:13 to go config:

NAME                                   STATE     READ WRITE CKSUM
tank                                   DEGRADED     0     0     0
  raidz2-0                             DEGRADED     0     0     0
    ata-ST10000NM0086-2AA101_ZA28MD7E  DEGRADED     0     0   415  too many errors
    ata-ST10000NM0086-2AA101_ZA28MCNG  DEGRADED     0     0    64  too many errors
    ata-ST10000NM0086-2AA101_ZA25NSFA  DEGRADED     0     0   271  too many errors
    ata-ST10000NM0086-2AA101_ZA28MD0H  DEGRADED     0     0   704  too many errors  (resilvering)
    ata-ST10000NM0086-2AA101_ZA26F2MG  DEGRADED     0     0    34  too many errors  (resilvering)
    ata-ST10000NM0086-2AA101_ZA26E1X0  DEGRADED     0     0    52  too many errors  (resilvering)
    ata-ST10000NM0086-2AA101_ZA26E1SS  DEGRADED     0     0    57  too many errors
    ata-ST10000NM0086-2AA101_ZA28M62Z  ONLINE       0     0     4
    ata-ST10000NM0086-2AA101_ZA26DN39  DEGRADED     0     0    59  too many errors  (resilvering)
    ata-ST10000NM0086-2AA101_ZA26E2M3  DEGRADED     0     0    44  too many errors

` CentOS 7.7 kernel 3.10.0-1062.4.1.el7.x86_64.

mscottie avatar Oct 27 '19 19:10 mscottie

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 28 '20 00:10 stale[bot]

anti-stale-bot-post

mscottie avatar Oct 28 '20 16:10 mscottie

Hi everyone.

I'm on Proxmox VE 6.3 with ZFS 0.8.5 and Kernel 5.4.73 and I think I hit the same or a similar Bug.

First i thought RAM or (brandnew) Disks are bad so i tested each component separately without any issue. So decided to test on a different server to eliminate possible errors related to the storage components but i still got errors. So i did more tests on both Servers with HDDs, SSDs and RAM Disks and all show the same problem.

~ # modprobe brd rd_nr=2 rd_size=10485760
~ # zpool create test2 mirror /dev/ram0 /dev/ram1

~ # zpool status
  pool: test2
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test2       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ram0    ONLINE       0     0     0
            ram1    ONLINE       0     0     0

errors: No known data errors


~ # zfs create -V 5G -o volblocksize=8k test2/testvol
~ # dd if=/dev/urandom of=/dev/zvol/test2/testvol bs=1M status=progress


~ # zpool status
  pool: test2
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test2       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ram0    ONLINE       0     0     0
            ram1    ONLINE       0     0     0

errors: No known data errors


~ # zpool scrub test2
~ # zpool detach test2 ram1
~ # zpool add test2 ram1
~ # zpool remove test2 ram0
~ # zpool attach test2 ram1 ram0

~ # zpool status
  pool: test2
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: resilvered 5.11G in 0 days 00:00:15 with 0 errors on Fri Nov 27 07:57:58 2020
remove: Removal of vdev 0 copied 5.05G in 0h0m, completed on Fri Nov 27 07:57:32 2020
    10.2K memory used for removed device mappings
config:

        NAME          STATE     READ WRITE CKSUM
        test2         DEGRADED     0     0     0
          mirror-1    DEGRADED     0     0     0
            ram1      ONLINE       0     0     0
            ram0      DEGRADED     0     0    18  too many errors

errors: No known data errors

Here are the zpool events from a third test where it happened on the second try or so (not sure anymore :) ). test2_ereport.txt

bang-uin avatar Nov 27 '20 07:11 bang-uin

I may have seen the same issue under Rocky Linux 8.4 and ZFS 2.1.0

root@zfsscratch ~]# zpool replace scratch
36000402009e01d5d415ae51700000000 36000402002e022454a69ead900000000
cannot open '/dev/mapper/36000402002e022454a69ead900000000': Device or
resource busy
         mirror-40                              ONLINE       0     0     0
           36000402002e022454a69908c00000000    ONLINE       0     0     0
           spare-1                              ONLINE       0     0     0
             36000402009e01d5d415ae51700000000  ONLINE       2     0     1
             36000402002e022454a69ead900000000  ONLINE       0     0    97
...
       spares
...
         36000402002e022454a69ead900000000      INUSE     currently in use

The resilver finished and I am now running a zpool clear && zpool scrub.

stuartthebruce avatar Sep 13 '21 15:09 stuartthebruce

Hi,

I experienced the same checksum error report after attaching a brand new disk to an existing vpool of a single disk. The system runs Debian 11 with ZFS 2.0.6.

An automatic scrub was running when I zpool attache'd the device. I did not stopped it manually as the zpool-attach(8) manual says:

new_device begins to resilver immediately and any running scrub is cancelled.

The brand new disk is ata-HGST_HUS728T8TALE6L4_VRH4JTEK. Here is the command I ran (from zpool history):

2021-11-14.02:26:53 zpool attach tank ata-TOSHIBA_HDWG480_71Q0A07SFR0H /dev/disk/by-id/ata-HGST_HUS728T8TALE6L4_VRH4JTEK

I checked the status few minutes after attaching the device, it was already reporting checksum errors:

# zpool status tank
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Nov 14 02:26:42 2021
        1.25T scanned at 5.17G/s, 33.9G issued at 140M/s, 2.43T total
        33.9G resilvered, 1.36% done, 04:59:16 to go
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-TOSHIBA_HDWG480_71Q0A07SFR0H   ONLINE       0     0     0
            ata-HGST_HUS728T8TALE6L4_VRH4JTEK  ONLINE       0     0 2.72K  (resilvering)

errors: No known data errors

The resilvering is now finished, the checksum errors did not change:

# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 2.44T in 04:17:38 with 0 errors on Sun Nov 14 06:44:20 2021
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-TOSHIBA_HDWG480_71Q0A07SFR0H   ONLINE       0     0     0
            ata-HGST_HUS728T8TALE6L4_VRH4JTEK  ONLINE       0     0 2.72K

errors: No known data errors

The events:

Nov 14 2021 00:24:01.940729556 sysevent.fs.zfs.scrub_start
        version = 0x0
        class = "sysevent.fs.zfs.scrub_start"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        time = 0x61904911 0x381264d4 
        eid = 0xf

Nov 14 2021 00:24:01.940729556 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "func=1 mintxg=0 maxtxg=307454"
        history_internal_name = "scan setup"
        history_txg = 0x4b0fe
        history_time = 0x61904911
        time = 0x61904911 0x381264d4 
        eid = 0x10

Nov 14 2021 02:26:35.458148864 sysevent.fs.zfs.vdev_attach
        version = 0x0
        class = "sysevent.fs.zfs.vdev_attach"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xa35d7634b2634316
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/disk/by-id/ata-HGST_HUS728T8TALE6L4_VRH4JTEK-part1"
        vdev_devid = "ata-HGST_HUS728T8TALE6L4_VRH4JTEK-part1"
        time = 0x619065cb 0x1b4ecc00 
        eid = 0x11

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "errors=0"
        history_internal_name = "scan aborted, restarting"
        history_txg = 0x4b644
        history_time = 0x619065d2
        time = 0x619065d2 0x3328d57c
        eid = 0x12

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "errors=0"
        history_internal_name = "starting deferred resilver"
        history_txg = 0x4b644
        history_time = 0x619065d2
        time = 0x619065d2 0x3328d57c
        eid = 0x13

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x619065d2 0x3328d57c
        history_hostname = "larco"
        history_internal_str = "errors=0"
        history_internal_name = "starting deferred resilver"
        history_txg = 0x4b644
        history_time = 0x619065d2
        time = 0x619065d2 0x3328d57c
        eid = 0x13

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x619065d2 0x3328d57c

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x619065d2 0x3328d57c
        eid = 0x14

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        eid = 0x14

Nov 14 2021 02:26:42.858314108 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "func=2 mintxg=3 maxtxg=308804"
        history_internal_name = "scan setup"
        history_txg = 0x4b644
        history_time = 0x619065d2
        time = 0x619065d2 0x3328d57c
        eid = 0x15

Nov 14 2021 02:26:46.318391389 sysevent.fs.zfs.config_sync
        version = 0x0
        class = "sysevent.fs.zfs.config_sync"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        time = 0x619065d6 0x12fa445d
        eid = 0x16

Nov 14 2021 02:26:53.542552777 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "attach vdev=/dev/disk/by-id/ata-HGST_HUS728T8TALE6L4_VRH4JTEK-part1 to vdev=/dev/disk/by-id/ata-TOSHIBA_HDWG480_71Q0A07SFR0H-part1"
        history_internal_name = "vdev attach"
        history_txg = 0x4b646
        history_time = 0x619065dd
        time = 0x619065dd 0x2056b2c9
        eid = 0x17

Nov 14 2021 06:44:20.973882171 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "larco"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x4c11f
        history_time = 0x6190a234
        time = 0x6190a234 0x3a0c433b
        eid = 0x18

Nov 14 2021 06:44:20.973882171 sysevent.fs.zfs.resilver_finish
        version = 0x0
        class = "sysevent.fs.zfs.resilver_finish"
        pool = "tank"
        pool_guid = 0xbb90af7e6f5976c2
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x6190a234 0x3a0c433b
        eid = 0x19

ioguix avatar Nov 14 '21 22:11 ioguix

For the record, I clear'ed then scrub'ed my pool. No error reported:

# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 04:02:52 with 0 errors on Mon Nov 15 04:13:53 2021
config:

	NAME                                   STATE     READ WRITE CKSUM
	tank                                   ONLINE       0     0     0
	  mirror-0                             ONLINE       0     0     0
	    ata-TOSHIBA_HDWG480_71Q0A07SFR0H   ONLINE       0     0     0
	    ata-HGST_HUS728T8TALE6L4_VRH4JTEK  ONLINE       0     0     0

errors: No known data errors

ioguix avatar Nov 15 '21 09:11 ioguix

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 22 '22 21:11 stale[bot]

I too am experiencing this error. A vpool containing one disk has another disk attached to it (mirror), and both resilvering and scrubbing causes checksum errors on the new disk. I can repeat the scrub numerous times and it always finds new checksum errors every scan.

I've reseated the drive onto the backplane a few times with no change. SMART data shows absolutely nothing wrong with either disk. Previous disks connected to this backplane slot have never caused errors. Running Debian 11, ZFS 2.1.4.

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 16K in 00:22:01 with 0 errors on Fri Jan  6 00:07:10 2023
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              DEGRADED     0     0     0
          mirror-0                        DEGRADED     0     0     0
            wwn-0x55cd2e404c47175f-part3  ONLINE       0     0     0
            wwn-0x55cd2e404c471471-part3  DEGRADED     0     0   166  too many errors

errors: No known data errors

c89williams avatar Jan 06 '23 05:01 c89williams

@c89williams rather than re-seating the disk, try zpool clear tank; zpool scrub tank. If, after the scrub, no error are shown, then your disk is ok.

shodanshok avatar Jan 06 '23 08:01 shodanshok

@shodanshok I have tried as you stated, clearing the error counters and rescrubbing, however checksum errors continue to occur.

c89williams avatar Jan 06 '23 08:01 c89williams

@c89williams in this case, I don't think you have the same issue as described here. Have you tried rebooting your machine and re-scrubbing the pool?

shodanshok avatar Jan 07 '23 20:01 shodanshok