zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Data corruption with 519851122b1703b8 ("ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()")

Open mjguzik opened this issue 1 year ago • 8 comments

This is on FreeBSD, but it should not matter.

I don't have a trivial testcase. The workload which runs into it is rather i/o heavy (package building) and the issue randomly pops up. Manifests itself mostly as strip(1) complaining about invalid file format in various ports, trying to build the same port the second time works just fine.

Note things are a little sketchy since there are 2 unrelated data corruption bugs, the other one being block_cloning (fixed since).

That said, grabbing the tree as of 519851122b1703b8 ("ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()") results in the non-deterministic corruption. Going one commit below makes it go away. I verified a bunch of times by running the wokload for ~1h -- with the problematic commit first problems show up within ~15 minutes or so. Unfortunately due to the nature of the machinery doing the build it is not easy to grab the corrupted data, but this can be worked out.

At the same time, if you can't repro the problem, access to the affected machine can be arranged.

In the meantime seeing as the commit was supposed to be just an optimization it should be reverted.

The pool uses one drive with 0 magic, created with mere 'zpool create'.

mjguzik avatar Apr 15 '23 20:04 mjguzik

@ahrens

mjguzik avatar Apr 15 '23 20:04 mjguzik

I would suspect the various reproducers from #11900 would work here, if need be (which were, in fact, also package building and sparse copying.)

rincebrain avatar Apr 15 '23 22:04 rincebrain

Thanks, I was one of the main participants in https://github.com/openzfs/zfs/issues/11900 and have hit this again with package building (a handful of GCC's internal libraries like libitm.so were installed as garbage).

I'm going to queue a revert in Gentoo for now. Very much appreciate @mjguzik reporting this so quickly to save me tearing my hair out :)

thesamesam avatar Apr 16 '23 03:04 thesamesam

cc @behlendorf too as you fixed it last time around and hopefully still have some of this paged in to your brain

thesamesam avatar Apr 16 '23 03:04 thesamesam

i wonder why automated testing did not find , that this change causing corruption.

would it make sense to create some repro-case and improve zfs automated testing with such tool/method ?

devZer0 avatar Apr 16 '23 17:04 devZer0

I believe Brian added a test case in #12724; that said, it's a race condition, so unless you have the code instrumented to force lockstep ordering of a specific outcome, it's gonna be nondeterministic whether you hit it in a given test run.

rincebrain avatar Apr 16 '23 19:04 rincebrain

An earlier test we FreeBSD folks did was issuing cp -R on a sufficiently large directory tree to another location on the same dataset. It helped scope down what caused block_cloning corruption but not this, probably because cp -R operates sequentially. Package building on the other hand, especially with jobservers involved, does not necessarily operate sequentially.

vishwin avatar Apr 17 '23 02:04 vishwin

@tonyhutter I'm sure if I do this one or two more times you'll be cross, but uh, "we re-introduced a silent data corruption we previously fixed in 2.1.10" seems like it's worth pinging you and asking for a 2.1.11 reverting 4b3133e671b958fa2c915a4faf57812820124a7b.

rincebrain avatar Apr 17 '23 05:04 rincebrain

@rincebrain thanks, we'll do the revert in master (https://github.com/openzfs/zfs/pull/14761) and then pull it into a new release.

tonyhutter avatar Apr 17 '23 23:04 tonyhutter

will a scrub show this? is there another way to detect this kind of corruption?

psy0rz avatar Jun 05 '23 20:06 psy0rz

A scrub would not, because from ZFS's perspective, it wrote things correctly, it just was at read time that something tried to use the SEEK_HOLE/DATA information to read things and got incorrect results. (I think.)

rincebrain avatar Jun 05 '23 21:06 rincebrain

fixed here https://github.com/openzfs/zfs/pull/14761

mjguzik avatar Mar 30 '24 13:03 mjguzik