zfs [Feature Request]: Expose certain `zdb` object information without super user privileges

Describe the feature would like to see added to OpenZFS

Expose zdb object information, such as the underlying file checksums, without requiring super user privileges. You can view such information by invoking something like: sudo zdb -vvvvvvv -O "$(httm -m=device -n ~/.histfile)" "$(httm -m=relative -n ~/.histfile)"

How will this feature improve OpenZFS?

This feature would make it easier to distinguish whether a file is unique, or is the same as another version, simply by comparing their checksums. I write a tool httm, which uses the crude method of distinguishing files on the basis of size and mtime. This works well enough for my purposes, in most cases, but for files which may be overwritten with the same contents, but which have a different mtime, my tool will report false positives (imagine a file which is overwritten with the same version when a package is updated, but which now has a different mtime).

Other tools such as rsync use rolling checksums to compare files. As you are no doubt aware, this alternative is much slower than simply reading back file metadata.

This feature might have the benefit of also speeding up or reducing the CPU burden of rsync execution, should rsync choose to use the exposed checksums as well.

Additional context

AFAICT this information is not private or secret. It can all readily be obtained with stat and a checksum tool.

If libzfs or another library already exposes such a feature already, I'd be pleased to use that library.

Feb 27 '23 16:02 kimono-koans

I'd like to see the output of zdb without sudo to actually see the error (and ideally without httm, since I don't have it in front of me), but I'm almost certain what I'll see:

$ sudo -u nobody /sbin/zdb -d lucy 0
zdb: can't open 'lucy': Permission denied

zdb actually does all its work in userspace against the raw block devices, without caring if the pool is imported or even if the kernel knows about ZFS at all. So I suspect this is just that your regular user doesn't have access to the block device nodes in /dev that have the pool on them.

My regular user does, so its working fine:

$ id -u
1000

$ /sbin/zdb -d lucy 0
Dataset mos [META], ID 0, cr_txg 4, 350M, 7879 objects

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    3   128K    16K  26.9M     512    80M    4.81  DMU dnode

Feb 27 '23 20:02 robn

I'm almost certain what I'll see:

This is correct.

So I suspect this is just that your regular user doesn't have access to the block device nodes in /dev that have the pool on them.

Ah. This makes sense. And I presume there is probably no other way to expose/reveal through another system call or via libzfs (or that would be a very tall order)?

Thanks.

Feb 27 '23 22:02 kimono-koans

You could expose it, that's not a hard problem conceptually, since ZFS itself knows that data, without zdb being involved.

There are a couple of reasons that might be a bad idea, though.

fletcher4 is strong enough to notice bitflips reliably but not to be used without more comparison to confirm files are identical or not, and it's the default, so people shouldn't be using that as a shortcut against hashing it themselves
checksums are also against records, not whole files
some checksums use a per-pool salt, so they aren't comparable across pools
for native encryption, the checksums are of the encrypted data, and the encryption has a per-dataset key (even if they have a shared encryptionroot), so the same record won't necessarily have the same checksums stored
the checksum on disk isn't necessarily the checksum of the data when you go to read it - sure, ZFS will throw an error if it's not, but you might get a nasty surprise if you remove one copy of your data in favor of one that turns out to not be intact.

I'd also find this useful, but there's a lot of caveats around leveraging it. Just FYI.

(This is not exhaustive, there are other caveats I didn't feel like writing an explanation of too.)

Feb 27 '23 22:02 rincebrain

FWIW, I've opened #14539 to add a small note to the zdb manpage, since I've seen this confusion before. I don't know if it would have helped here but at least it will be written down somewhere.

Feb 27 '23 22:02 robn

I appreciate @robn your efforts to clarify the situation in the zdb man page, and @rincebrain your discussion of some of the pitfalls.

And I can definitely understand how it might be a pain to implement, simply not worth the engineering effort, or a wart on the implementation. But, especially for my use case, which is, generally, distinguishing between snapshot versions on the most immediate local pool, I can only presume non-privileged access would be much faster than the alternative.

BTW the alternative is to recalculate a cheap rolling checksum for each file and to compare. For moderate sized files (400K), this is 50-75x slower than simply reading back and comparing the metadata, and makes it inadequate for interactive TUI use.

➜  hyperfine -w 3 "httm -n --unique=absolute /usr/bin/httm" "httm -n --unique=default /usr/bin/httm"

Benchmark 1: httm -n --unique=absolute /usr/bin/httm
  Time (mean ± σ):      1.250 s ±  0.008 s    [User: 0.222 s, System: 1.059 s]
  Range (min … max):    1.235 s …  1.259 s    10 runs

Benchmark 2: httm -n --unique=default /usr/bin/httm
  Time (mean ± σ):      16.3 ms ±   0.8 ms    [User: 13.3 ms, System: 35.8 ms]
  Range (min … max):    14.9 ms …  20.7 ms    161 runs

Summary
  'httm -n --unique=default /usr/bin/httm' ran
   76.64 ± 3.93 times faster than 'httm -n --unique=absolute /usr/bin/httm'

So -- yes, I really can understand how this feature wouldn't be a priority, but please let me add my voice those that might find it useful. Perhaps in the context of another more general libzfs feature it makes more sense. Just re: my own projects, I find its useful to say "No, sorry..." initially, and then let work in the back of mind until its "Maybe, if..."

Thank you both. Understand if you'd prefer to close.

PS: I'd note rsync is not the only other use case. I have wondered (and perhaps you have as well) exactly what happen to ZFS object storage pools, because backing up a ZFS pool to blob store is currently non-trivial. As an alternative, I can imagine a blob storage solution (like a borg or a restic) which retains ZFS object information as metadata for each object in its store, and which could confirm, upon restoration, an object has the same state it had when it was backed up, even recreate snapshots within its store .

Mar 01 '23 13:03 kimono-koans

I appreciate @robn your efforts to clarify the situation in the zdb man page, and @rincebrain your discussion of some of the pitfalls.

And I can definitely understand how it might be a pain to implement, simply not worth the engineering effort, or a wart on the implementation. But, especially for my use case, which is, generally, distinguishing between snapshot versions on the most immediate local pool, I can only presume non-privileged access would be much faster than the alternative.

BTW the alternative is to recalculate a cheap rolling checksum for each file and to compare. For moderate sized files (400K), this is 50-75x slower than simply reading back and comparing the metadata, and makes it inadequate for interactive TUI use.
➜  hyperfine -w 3 "httm -n --unique=absolute /usr/bin/httm" "httm -n --unique=default /usr/bin/httm"

Benchmark 1: httm -n --unique=absolute /usr/bin/httm
  Time (mean ± σ):      1.250 s ±  0.008 s    [User: 0.222 s, System: 1.059 s]
  Range (min … max):    1.235 s …  1.259 s    10 runs

Benchmark 2: httm -n --unique=default /usr/bin/httm
  Time (mean ± σ):      16.3 ms ±   0.8 ms    [User: 13.3 ms, System: 35.8 ms]
  Range (min … max):    14.9 ms …  20.7 ms    161 runs

Summary
  'httm -n --unique=default /usr/bin/httm' ran
   76.64 ± 3.93 times faster than 'httm -n --unique=absolute /usr/bin/httm'
So -- yes, I really can understand how this feature wouldn't be a priority, but please let me add my voice those that might find it useful. Perhaps in the context of another more general libzfs feature it makes more sense. Just re: my own projects, I find its useful to say "No, sorry..." initially, and then let work in the back of mind until its "Maybe, if..."

Thank you both. Understand if you'd prefer to close.

PS: I'd note rsync is not the only other use case. I have wondered (and perhaps you have as well) exactly what happen to ZFS object storage pools, because backing up a ZFS pool to blob store is currently non-trivial. As an alternative, I can imagine a blob storage solution (like a borg or a restic) which retains ZFS object information as metadata for each object in its store, and which could confirm, upon restoration, an object has the same state it had when it was backed up, even recreate snapshots within its store .

None of what you've said here addresses the fact that ZFS checksums are on records whereas rsync, httm, as well as many others operate on the filesystem-level, not on the dnode level.

This, effectively, means that two files (which were asynchronously written to the disk and became part of the same dirty data buffer window and therefore written as a single transaction group), can return the same checksum - that's not behaviour that rsync, httm, or anything else that deals with filesystem-level I/O can (or should) be expected to deal with.

Mar 01 '23 16:03 debdrup

None of what you've said here addresses the fact that ZFS checksums are on records whereas rsync, httm, as well as many others operate on the filesystem-level, not on the dnode level.

Perhaps I misunderstand, but don't those records directly correspond to a filesystem object?

Consider an invocation of sudo zdb -vvvvvvv -O "$(httm -m=device -n ~/.histfile)" "$(httm -m=relative -n ~/.histfile)":

obj=66396 dataset=rpool/USERDATA/kimono_9vk812 path=/.histfile type=19 bonustype=44

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     66396    2   128K   128K   448K      1K  1.25M  100.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                               176   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
	dnode maxblkid: 9
	uid     1000
	gid     1000
	atime	Tue Feb 28 23:21:43 2023
	mtime	Wed Mar  1 07:38:53 2023
	ctime	Wed Mar  1 07:38:53 2023
	crtime	Tue Feb 28 23:21:43 2023
	gen	2507817
	mode	100644
	size	1222745
	parent	34
	links	1
	pflags	840800000004
Indirect blocks:
               0 L1  0:90da984000:1000 20000L/1000P F=10 B=2514234/2514234 cksum=ae8f06325d:24cf5c0fa73b0:3e8be54d2df8efb:75f4a0463a8453b5
               0  L0 0:992324a000:9000 20000L/9000P F=1 B=2507817/2507817 cksum=82a8a067799:997f2bee0ba577:463d6c7627d81d8d:6c25a30ae887041c
           20000  L0 0:99b6625000:a000 20000L/a000P F=1 B=2507817/2507817 cksum=9399bed46a7:bdf932d5c1a234:fc578189b0333be0:b87ce7045449589
           40000  L0 0:99b662f000:e000 20000L/e000P F=1 B=2507817/2507817 cksum=c24b2df0773:169214914288f34:875c3fdae6c01e5b:9f320ce4b561a0c3
           60000  L0 0:9923253000:d000 20000L/d000P F=1 B=2507817/2507817 cksum=b55a6fb7964:13208ffa07d1549:f4d9e21f3e15800e:64e9b5792c44393
           80000  L0 0:98fc392000:c000 20000L/c000P F=1 B=2507817/2507817 cksum=a79cb0fe372:10474348531f380:7a435ea56e57974e:eddfad71dac26685
           a0000  L0 0:9923260000:b000 20000L/b000P F=1 B=2507817/2507817 cksum=9a7069d69ea:dcf3ebd502ae20:be2fc42874c38684:a341e24c20bceedb
           c0000  L0 0:992326b000:b000 20000L/b000P F=1 B=2507817/2507817 cksum=99c1192fc44:d7259c74ab935a:7ff464c98b379d6d:86cd9ed775a5057e
           e0000  L0 0:9923276000:d000 20000L/d000P F=1 B=2507817/2507817 cksum=b97c3d1b50c:138a7c01d48a2e8:4fa8b28353bb455a:847dcf6f7dd78e45
          100000  L0 0:98fc3a2000:c000 20000L/c000P F=1 B=2507817/2507817 cksum=a6a6c894e5a:1097d515cc48dae:b552693317563303:6619656b0136ebb1
          120000  L0 0:90da97f000:5000 20000L/5000P F=1 B=2514234/2514234 cksum=3ebdcd68733:2d93f4197efa66:384c0d3822de361f:51a7cfd0e42f7f84

		segment [0000000000000000, 0000000000140000) size 1.25M

For my purposes, I'm loading my path data into a Rust BTreeMap to sort and exclude duplicates, and I only need to know if the file versions, that is snapshots, differ. Other data will help me do other things, but in this instance, on the same dataset, all I need to do is compare the checksum records of one version to the checksum records another file version? Any API could also just merge the records for you.

Yes, it might be imperfect comparing across datasets or with different underlying ZFS properties, but you could also include enough information in any API, so that you could also just fallback to different behavior.

This, effectively, means that two files (which were asynchronously written to the disk and became part of the same dirty data buffer window and therefore written as a single transaction group), can return the same checksum - that's not behaviour that rsync, httm, or anything else that deals with filesystem-level I/O can (or should) be expected to deal with.

I'm not sure I understand your point. Could you further explain your thinking? Why exactly is it a problem that two files have the same checksum?

Mar 01 '23 17:03 kimono-koans

Let's say you do the following:

echo test1 > test1
echo test2 > test2
echo test1test2 > test3

You can then use sha256 (on FreeBSD at least, your Unix-like may have a different binary that accomplishes the same) to check the checksums of the files: sha256 test1.txt test2.txt test3.txt

These checksums shouldn't be the same, if the two files contain different information (with the possible exception of checksum collisions resulting from non-secure cryptographic hashes - but that's outside the scope).

Echoing text into a file (or writing via an editor like nano or vi) is an asynchronous write. This means that the kernel will not wait for the storage subsystem to write the file to disk (*).

ZFS has a mechanism by which it tries to group asynchronous writes, by using what's called a dirty data buffer. There's multiple reasons for doing this, but the biggest is that it ensures that all those individually-asynchronous writes get written sequentially to disk, in something called a transaction group.

This transaction group ends up on disk as what's called a record, which is a variable-length set of LBAs. At some point during the writing process, a checksum for the record is also computed. This checksum covers the entire contents of the record, which if we're looking at the above example would be somewhat-analogous to doing the third command in the above example.

It's not quite the same, but if you wanna know more you're gonna need to understand several books worth of information - all the same, it still means that you'll get a different checksum than the previous checksums.

However, if you were to use zdb to examine either file1, file2, or file3, you'd find that, assuming they're part of the same record, they'll all return the same checksum - because they're part of the same record.

*: This behaviour is not unique to ZFS, any filesystem will work like this. Unfortunately it's more complicated than that, because some Unix-likes (including Linux) also implement block devices (in the mistaken belief that this makes things better, because it's faster) which introduce caching that make it impossible to correlate reads, writes, and their results for filesystems that aren't ZFS.

Mar 02 '23 14:03 debdrup

That's not how ZFS records work.

Things do get batched up in txgs, and checksums are of entire records, but the entire txg is not treated as a record.

Mar 02 '23 15:03 rincebrain

That's not how ZFS records work.

Things do get batched up in txgs, and checksums are of entire records, but the entire txg is not treated as a record.

I thought about adding a caveat emptor that it's even more complex than described, then promptly forgot about it.

Point is, nobody can rely on the record checksums for the kind of thing they traditionally get used for.

Mar 02 '23 16:03 debdrup

@debdrup appreciate you further explaining your thinking, and @rincebrain you adding your thoughts.

Point is, nobody can rely on the record checksums for the kind of thing they traditionally get used for.

I think we need to think of this in very limited terms. The rsync, httm use case is simply: Can we determine whether this file version the same, or different, as another file version from the available metadata? Understood, re: rsync, there are a dozen caveats as the datasets are potentially very different.

And I think it's fair to scope out what I do and don't know, and what I would actually be getting with this API. I said:

Perhaps I misunderstand, but don't those records directly correspond to a filesystem object?

First question: Is this the case?

If the checksums presented by zdb ever do not directly correspond to a file, that might be an issue for my use case, but I'm actually not entirely sure.

For instance, if a 6K file is written to disk with a bunch of other small files in a single record, then overwritten with the same data, but now along with new different small files, and now the checksums are distinct, that would seem to be a false positive.

Second question: Can one mitigate?

Perhaps in combination with knowing the inode has changed for our small file, as well, one could fall back to different behavior?

Proving any of this with a test case, for my use case, seems difficult, as zdb seems to have some issue working on snapshot versions:

failed to lookup dataset=rpool/USERDATA/kimono_9vk812 path=/.zfs: No such file or directory

Mar 02 '23 17:03 kimono-koans

@debdrup: There are no persistent checksums tied to a transaction group's deltas in ZFS; it is not a log-structured filesystem. Just to call out some things that might be confused for the kind of identifier you're describing:

The nearest thing is the per-txg ZIL entry and its checksum, but ZIL records are destroyed after they have successfully been played forward onto the (non-SLOG) storage. In many cases they barely exist at all.
The second nearest thing is the (completed) txg's uberblock, but that, too, is destroyed once the uberblock ring rolls over, and it is a (tree) digest of the entire state of the pool, not merely the updates that took place in the txg that wrote it.

While old, http://www.giis.co.in/Zfs_ondiskformat.pdf continues to mostly describe the on-disk layout of ZFS. The most relevant bits are the following figures (and the discussion around them):

Figure 6 describes the self-validating structure of an uberblock.
Figure 8 describes the layout of a block pointer (of which there is one in every uberblock).

It could, in principle, be sensible to expose specifically the L0 block pointers' checksums to userspace, as those scope only over file data, but as pointed out above (https://github.com/openzfs/zfs/issues/14536#issuecomment-1447229400), ZFS will have done record splitting, salting, and encrypting in ways not exposed, and possibly difficult to expose safely, to userspace. Non-L0 block pointers' checksums will scope over pool-internal data, most notably DVAs (as part of the lists of block pointers to which non-L0 block pointers point), and so are probably not of interest to userspace.

Mar 02 '23 17:03 nwf

The way to get zdb to work on a snapshot is to use the pool/fs@snap syntax, not the .zfs directory (which is a posix illusion created for convenience). I would also second what @nwf and @rincebrain said; @debdrup's description of how checksums work is not accurate.

That said, you could theoretically use the checksum of a block higher up in the dnode to determine if things had changed between one time and another, but it's easier to use the birth time for the blkptrs. That prevents having to worry about checksum collision or checksum strength. Under the hood, this is how zfs send determines what to include in incremental sends. When combined with nopwrite, this also prevents the issue where mtimes change but the contents of a file stay the same.

Mar 02 '23 18:03 pcd1193182

Hat tip to @pcd1193182 re: how to use zdb with snapshot datasets.

Perhaps I misunderstand, but don't those records directly correspond to a filesystem object?

First question: Is this the case?

Re: my modest testing, the checksums seem stable for my use case? I presume I'm missing something.

➜  ~ ls -al .zshrc
-rw-r--r-- 1 kimono kimono 5995 Mar  2 14:06 .zshrc
➜  ~ cp .zshrc zshrc_backup
➜  ~ httm -S .
httm took a snapshot named: rpool/USERDATA/kimono_9vk812@snap_2023-03-02-14:06:25_httmSnapFileMount
➜  ~ cp zshrc_backup .zshrc
➜  ~ httm -S .
httm took a snapshot named: rpool/USERDATA/kimono_9vk812@snap_2023-03-02-14:06:46_httmSnapFileMount
➜  ~ # "unique" below means all snaps which have a different size or mtime
➜  ~ httm --list-snaps=unique -n ~/.zshrc | xargs -I{} sudo zdb -vvvvvv -O "{}" .zshrc | grep cksum
               0 L0 0:243be8000:1000 1000L/1000P F=1 B=10748/10748 cksum=14a416247c1:2bc6413cddfb8:3aab0c70e8bbef6:ad1e9fa799317375
               0 L0 0:8a420c8000:1000 1800L/1000P F=1 B=48522/48522 cksum=1079408c2eb:241d7e1e32c9d:306332a3153ee63:4aeab7b2cab4042
               0 L0 0:ace21b5000:1000 1800L/1000P F=1 B=606838/606838 cksum=1079408c2eb:241d7e1e32c9d:306332a3153ee63:4aeab7b2cab4042
               0 L0 0:b9ee98f000:1000 1800L/1000P F=1 B=1899664/1899664 cksum=10f8eeb09e3:26b1d81b707fa:3588e78093a5024:6ab887218728628f
               0 L0 0:4ef4de7000:1000 1800L/1000P F=1 B=2225628/2225628 cksum=118940edf93:26cc9a83f3257:34c5801da3fa14e:5707109b8bcccbbc
               0 L0 0:54e58b2000:1000 1800L/1000P F=1 B=2246335/2246335 cksum=118940edf93:26cc9a83f3257:34c5801da3fa14e:5707109b8bcccbbc
               0 L0 0:a44430c000:1000 1800L/1000P F=1 B=2537184/2537184 cksum=118940edf93:26cc9a83f3257:34c5801da3fa14e:5707109b8bcccbbc
               0 L0 0:a44445b000:1000 1800L/1000P F=1 B=2537196/2537196 cksum=118940edf93:26cc9a83f3257:34c5801da3fa14e:5707109b8bcccbbc

Mar 02 '23 20:03 kimono-koans

don't those records directly correspond to a filesystem object?

Short answer is "no".

A record is a little like a disk block in conventional filesystems: a single addressable "unit" of storage, that all other things are made of. For each record it stores, a checksum is also computed in that record's "block pointer", an object that includes one or more locations where the record can be found (the redundant copy), the checksum and other metadata needed to access and process the record.

ZFS then constructs more complex objects that it wants stored. Ultimately, these get written into records. A single object can use multiple records. A POSIX file object is particularly complicated; there's a record or two devoted to file metadata, and as many data records as necessary to store the file data.

The thing is, you can't infer anything about how the larger object is spread out among those records, as records get split and joined, and various transformations like compression and encryption get applied that mean the stored data might be nothing like the original data, and might not have a 1:1 mapping.

Here's some examples.

I use ZFS native encryption. In this, every record gets encrypted, which modifies its length (the cryptographic signature and other metadata is added to the record). The inputs to the encryption algorithm include the location, so even the same data will come back with different checksums:

$ dd if=/dev/random of=randfile1 bs=1234567 count=1
1+0 records in
1+0 records out
1234567 bytes (1.2 MB, 1.2 MiB) copied, 0.00761135 s, 162 MB/s

$ cp randfile1 randfile2

$ sha1sum randfile1 randfile2 
5dee8e24a32eabac7698f09b3524ac0c970a03bd  randfile1
5dee8e24a32eabac7698f09b3524ac0c970a03bd  randfile2

$ ~/code/zfs/zdb -K $(sudo cat /etc/zfs/lucy.key) -vvvvv -O lucy/home/robn randfile1 | perl -lne 'print $1 if /cksum=(.+)/'
0443a9157dfa2eca:cde1577f47e0bd97:955da35b253f541f:e50fe404e1c20a75
533fac801ad2f0b9:226dcbe81cd10102:9bc53a5dd5135456:296d686c99d55d16
9b9607dec18ae4e2:12ebbefcb2f66de2:a90eaa162204f601:e8937a4c6bac2554
f72036d5b6c29197:cd23258a49e2bb43:e735bf2562386378:31b142299e61557a
7bb4c0126d73722b:2a900aff1eb58373:96124153605088fc:336e13e0e5cd5a92
f4fe06dddfc76006:279e9e0807cac0c6:df00700da6ddf148:d363d73e88cd0da3
cbac89d937a562cd:82120ad293555bc0:f6e745b08c6e3870:34a0efbd2fc9bf83
c91c210994716a6a:b53c7157f97d5e33:043fb7d765407167:3637463d33cfe6c2
fd5216be74496078:9d1d0e62f1eb8262:9c50c9a725ab9698:ad6be28aaef769e5
b319f970fe4eefb2:2e93874d18023119:db06972a9e375144:e511282db6cafd88
79f6baef7b4da3a7:7dc525c316d13a3b:bf86f0ea514975e5:8d326bc8936ebd9b

$ zdb -K $(sudo cat /etc/zfs/lucy.key) -vvvvv -O lucy/home/robn randfile2 | perl -lne 'print $1 if /cksum=(.+)/'
043bded4d5ccc460:c74331542a4a722c:beb421aabb38d6e2:11b72603a92de807
da6dbd6710654b28:e5579951fa1da5e3:458ac8c3e15e41a9:0d204bf2ca52b890
a11c6efee6fbd08a:2d406249a206ce10:314e2ca098fa5a49:6519cbfbc76bb44f
c6aa6206f15f9e95:0fbfed5abb05ff08:7718c47de3615e20:15614763ade1cebe
416422e60a11f411:cb436c8a6c149772:764bd589a2e5ecb8:9184b788d17f4698
5464bd0e1ae73163:64dc0e8d796778e0:a50b7fb8dec6c934:203f8aeaec803567
0c0c97c636f7f1fc:46a32d62c2c89248:2977e1dc355e2efc:10b36d8ac31ab030
ee567dfea2de39f9:9f3366fc2fff98dd:b4d73d0cdd6ac6e6:1bd8a39582b62d7f
cd222e89c8581b2c:6c4a2d15d11c07da:f0cb6aceb2c6d061:fd193c980f0992d7
ed4ddc70bfeacf5e:f8585974dec6c880:58b9beadc7dcf0e7:bf5ced44dbf95efa
2da4612d43eb5090:a5d1efddfe442539:16ebbc4747575fca:0650fa4e434fd3fb

Even on a dataset without encryption, we can change the compression algorithm used for new records; old records will not be rewritten, so will be different:

# zfs create -o compression=lz4 garden/test                                                                                    
                                                                                                                               
# dd if=/dev/random of=/garden/test/file1 bs=1234567 count=1                                                                   
1+0 records in                                                                                                                 
1+0 records out                                                                                                                
1234567 bytes transferred in 0.010657 secs (115844391 bytes/sec)                                                               
                                                                                                                               
# zfs set compression=zstd garden/test                                                                                         
                                                                                                                               
# cat /garden/test/file1 > /garden/test/file2                                                                                  
                                                                                                                               
# sha1sum /garden/test/file1 /garden/test/file2                                                                                
98fdb25b478519f05867b35048b5f590b820f307  /garden/test/file1                                                                   
98fdb25b478519f05867b35048b5f590b820f307  /garden/test/file2                                                                   
                                                                                                                               
# zdb -vvvvv -O garden/test file1 | perl -lne 'print $1 if /cksum=(.+)/'                                                       
000000a9053dac01:000241b329e42601:03e15433234c6fd5:79bfc1aa2a6bc5d2                                                            
0000404bd73f9e97:1013e1cede1543ad:eb6dfbecdd355a01:02596661ea6c341e                                                            
00003ff12696991d:0ffe875dc75e7d0c:2d2e59b6376b5224:f29729df9806b504                                                            
00003f9a6f0064cf:0fe3e69e101b1db3:071ef29f1cc83231:8ceeb982c1e2662a                                                            
00003fe38c81f182:0ff5bd9225ee64c1:a9f02b90c1e95f3d:715ace6be7406733                                                            
00003ffbbb9ae036:0ffc94ccd530291e:920fd59aa3439291:8a91d3e4cf5794e8                                                            
00003fed5c838ba3:1006cf4412f80399:73a2f092ba68031a:10f1dba20d99a9dd                                                            
000040102317e57d:10028c2e441c9566:2f208804c776193a:96f3c7d2cecedd0b                                                            
0000402d01690d45:1009b49efcfa7129:6354b18941a6b3f5:35097e035127f9a0                                                            
00003fd3e9726f63:0ffa1e1c87eb1031:7082e2bd82d2974c:1696b735f40487f9                                                            
00001b7aa8db97dc:03170270c3645d01:d6348c5b1356808e:a00fb0070a078684                                                            
                                                                                                                               
# zdb -vvvvv -O garden/test file2 | perl -lne 'print $1 if /cksum=(.+)/'                                                       
000000a8f68e2294:0002414e9c1270a9:03e0477593795c3c:78113dada6d275a6                                                            
0000404bd73f9e97:1013e1cede1543ad:eb6dfbecdd355a01:02596661ea6c341e                                                            
00003ff12696991d:0ffe875dc75e7d0c:2d2e59b6376b5224:f29729df9806b504                                                            
00003f9a6f0064cf:0fe3e69e101b1db3:071ef29f1cc83231:8ceeb982c1e2662a                                                            
00003fe38c81f182:0ff5bd9225ee64c1:a9f02b90c1e95f3d:715ace6be7406733                                                            
00003ffbbb9ae036:0ffc94ccd530291e:920fd59aa3439291:8a91d3e4cf5794e8                                                            
00003fed5c838ba3:1006cf4412f80399:73a2f092ba68031a:10f1dba20d99a9dd                                                            
000040102317e57d:10028c2e441c9566:2f208804c776193a:96f3c7d2cecedd0b                                                            
0000402d01690d45:1009b49efcfa7129:6354b18941a6b3f5:35097e035127f9a0                                                            
00003fd3e9726f63:0ffa1e1c87eb1031:7082e2bd82d2974c:1696b735f40487f9                                                            
00001adabe3f2fec:0311edcf8459fbd9:8b35da7d29c1587c:379d77c57e5acab6

Changing the checksum algorithm would produce a similar effect.

Even within the same dataset and without changing properties we can manipulate things such that the "same" file has two different on-disk representation:

# zfs create -o compression=off encryption=off lucy/test

## copy a file, zero out a whole record
# dd if=/usr/share/dict/words of=/lucy/test/file1 bs=4K                 count=256                                              
# dd if=/dev/zero             of=/lucy/test/file1 bs=4K         seek=32 count=32  conv=notrunc                                 

## copy the first record of a file, then seek past where the second would be and copy the rest
# dd if=/usr/share/dict/words of=/lucy/test/file2 bs=4K                 count=32                                               
# dd if=/usr/share/dict/words of=/lucy/test/file2 bs=4K skip=64 seek=64 count=192 conv=notrunc                                 
                                                                                                                               
# sha1sum /lucy/test/file1 /lucy/test/file2                                                                                    
c395fd6ae082379d73a7f99a64edd2d535cdb135  /lucy/test/file1                                                                     
c395fd6ae082379d73a7f99a64edd2d535cdb135  /lucy/test/file2              

# zdb -vvvvv -O lucy/test file1 | sed -n -e '/Indirect/,$p'
Indirect blocks:
               0 L1  0:5d39bad000:1000 20000L/1000P F=8 B=1084974/1084974 cksum=000000a45e08722f:00023e9b60a7e88e:03f132f5a550e877:a35c3900d017395f
               0  L0 0:5d7db4b000:20000 20000L/20000P F=1 B=1084972/1084972 cksum=00002c8fde42fa88:0b2810ce10228929:22a5bf049e05f7c7:70a68cdb09f7cbe1
           20000  L0 0:5d7debd000:20000 20000L/20000P F=1 B=1084974/1084974 cksum=0000000000000000:0000000000000000:0000000000000000:0000000000000000
           40000  L0 0:5d7dbce000:20000 20000L/20000P F=1 B=1084972/1084972 cksum=000030386190126c:0c0755aec8c88ff9:7e25b9ad0e4e0655:0ef4fcebca2c5140
           60000  L0 0:5d7dc0e000:20000 20000L/20000P F=1 B=1084972/1084972 cksum=00002fedb9ec561a:0bff931a4a387933:e55bfe291f80180b:48e3f0bfa01eaae2
           80000  L0 0:5d7dc55000:20000 20000L/20000P F=1 B=1084972/1084972 cksum=00003050eec317cc:0c1f1f57b132187c:cffa722ceb545fc9:2fb68c497edba488
           a0000  L0 0:5d7ddd8000:20000 20000L/20000P F=1 B=1084972/1084972 cksum=000030b6839b7cee:0c3212a4303f4249:6d61192e2b18eb3a:98c320bd97d4bea1
           c0000  L0 0:5d7de66000:20000 20000L/20000P F=1 B=1084972/1084972 cksum=00003096cd2ee5f7:0c18996100dec5b8:632177b4e3d9810c:c62b3d5c97518032
           e0000  L0 0:5d7de9d000:20000 20000L/20000P F=1 B=1084972/1084972 cksum=000015d0bd92a82a:087ad6c505ef52ee:a0e4eb24cbe118da:2d8d644eed2e6aac

		segment [0000000000000000, 0000000000100000) size    1M

# zdb -vvvvv -O lucy/test file2 | sed -n -e '/Indirect/,$p'
Indirect blocks:
               0 L1  0:ebd086000:1000 20000L/1000P F=7 B=1084977/1084977 cksum=0000009cde4feaf0:00022460b71c249d:03c2e60fcd4f74fc:6c614925df4959da
               0  L0 0:56ac518000:20000 20000L/20000P F=1 B=1084976/1084976 cksum=00002c8fde42fa88:0b2810ce10228929:22a5bf049e05f7c7:70a68cdb09f7cbe1
           40000  L0 0:56ac57b000:20000 20000L/20000P F=1 B=1084977/1084977 cksum=000030386190126c:0c0755aec8c88ff9:7e25b9ad0e4e0655:0ef4fcebca2c5140
           60000  L0 0:56ac538000:20000 20000L/20000P F=1 B=1084977/1084977 cksum=00002fedb9ec561a:0bff931a4a387933:e55bfe291f80180b:48e3f0bfa01eaae2
           80000  L0 0:56ac59b000:20000 20000L/20000P F=1 B=1084977/1084977 cksum=00003050eec317cc:0c1f1f57b132187c:cffa722ceb545fc9:2fb68c497edba488
           a0000  L0 0:56ac5db000:20000 20000L/20000P F=1 B=1084977/1084977 cksum=000030b6839b7cee:0c3212a4303f4249:6d61192e2b18eb3a:98c320bd97d4bea1
           c0000  L0 0:56ac5bb000:20000 20000L/20000P F=1 B=1084977/1084977 cksum=00003096cd2ee5f7:0c18996100dec5b8:632177b4e3d9810c:c62b3d5c97518032
           e0000  L0 0:56ac5fb000:20000 20000L/20000P F=1 B=1084977/1084977 cksum=000015d0bd92a82a:087ad6c505ef52ee:a0e4eb24cbe118da:2d8d644eed2e6aac

		segment [0000000000000000, 0000000000020000) size  128K
		segment [0000000000040000, 0000000000100000) size  768K

The first has a full record of all-zeroes; the second has a "hole" which the filesystem will inflate into a buffer of all-zeroes when a program asks for it.

This isn't all the possibilities, of course, just a handful that I could casually produce. The point of all this is just to show that the same apparent data at the filesystem level can yield very different record checksums under the hood. A difference in record checksums doesn't necessarily mean a difference in the file contents.

You can't even really assume the other way either, that matching checksums mean the same data. Most of the time the checksum will be the default fletcher4, which is not collision-resistant over two arbitrary inputs. That's fine for what it is; the whole point of record checksums is to notice bit flips from cosmic rays and confused disk controllers and thus let ZFS know it should go and try another copy of the data that isn't broken. If you found two unrelated records with the same checksum on a healthy pool, its pretty likely that they won't have the same data at all.

So .. yeah. I totally get what you're trying to do, and its a pretty sensible thing to dream of too - I've wanted it myself for exactly the same reasons you do (efficient file-level backups). Alas, its just not what record checksums are for.

Mar 03 '23 02:03 robn

@robn Appreciate your explanations. A few points:

Although the rsync case is given as an example use case, it is an additional use case. My own use case is, primarily, searches for unique snapshot versions on the same dataset.
I'm not certain the problems you identify re: the rsync case are actually problems, considered in context.

First, I agree that the fact there may be two files on the same dataset with the same contents, but which don't have the same checksum, at least, seems troubling. Yet, for my narrow use case, I am not sure we should particularly care. What I think we care most about is that we can identify a snapshot file version has changed as compared against another version of the same file. If we can't do that, then, yes, I think we have a problem.

See below re: example output of my program --

➜ httm ~/.zshrc
──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Thu Oct 27 22:35:57 2022  3.8 KiB  "/home/kimono/.zfs/snapshot/autosnap_2022-10-28_05:01:33_monthly/.zshrc"
Wed Nov 16 14:25:41 2022  5.9 KiB  "/home/kimono/.zfs/snapshot/autosnap_2023-01-01_00:00:23_monthly/.zshrc"
Thu Jan 26 11:35:12 2023  5.9 KiB  "/home/kimono/.zfs/snapshot/autosnap_2023-02-01_00:00:29_monthly/.zshrc"
Tue Feb 14 13:42:19 2023  5.9 KiB  "/home/kimono/.zfs/snapshot/snap_2023-02-20-14:04:21_prepApt/.zshrc"
──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Thu Mar 02 14:06:41 2023  5.9 KiB  "/home/kimono/.zshrc"
──────────────────────────────────────────────────────────────────────────────────────────────────────────────

Second, re: the rsync use case, which is again -- not my circus, I think anyone trying to do something similar with rsync or blob storage could simply constrain their problem. Perhaps, when I use my rsync like tool or blob storage, I just identify files which have changes on the source compared to a source snapshot, and leave the target up to the target, which, if you squint a little, is kinda how incremental snapshots work. And re: the mechanism, zfs diff already does something very similar?

I'm not saying this is something that should be high priority for ZFS. And I'm not saying that my use case is very exciting, or important. I'm saying it isn't completely ridiculous.

My feeling is "let a thousand flowers bloom" -- access to more ZFS internal information in user space via a library could be really interesting.

Mar 03 '23 05:03 kimono-koans

I was speaking to the specific question about the relationship between POSIX files and ZFS records, and in context, why ZFS record checksums aren't really related to a file's metadata or content. That was how I was understanding this thread: you asked, in essence, "can I have a nicer way to get the checksums?" and the answer has been "maybe, but it might not be what you want."

You mention zfs diff. If that is what you need, you might be in luck: there's already an ioctl, ZFS_IOC_OBJ_TO_STATS to get that info for zfs diff. There's no public interface wrapping it, but it should be a stable interface and I would expect we could make something for library authors to bind to (not really my call to make, but I can see it wouldn't be a huge lift).

https://github.com/openzfs/zfs/blob/master/include/sys/zfs_stat.h#L42 https://github.com/openzfs/zfs/blob/master/lib/libzfs/libzfs_diff.c#L67

Past that, if there's a specific piece of ZFS internals that would help write useful programs, I don't see any problem with exposing it in a sensible way, but I think a generic "more information please!" is probably not going to go very far, because we likely want to avoid exposing internals without a clear use case. Probably, nastily wrapping zdb is the way to prototype such things.

Mar 03 '23 06:03 robn

Just wanted to add a tl;dr: ZFS generates checksums from RAW on-disk data blocks, not from original file's full data (compression, encryption, recordsize, etc WILL change checksums).

Mar 03 '23 07:03 gmelikov

ztour would be a nice example of exposing that information to people, if anyone ever actually released the prototype after writing it.

More generally, you might want something more like #12837 if you want to do that kind of diff information en masse, or you'll have the same crappy scaling behavior zfs diff does of needing to make multiple ioctls per object.

Mar 03 '23 14:03 rincebrain

... it's easier to use the birth time for the blkptrs. That prevents having to worry about checksum collision or checksum strength. Under the hood, this is how zfs send determines what to include in incremental sends. When combined with nopwrite, this also prevents the issue where mtimes change but the contents of a file stay the same.

@pcd1193182 could you explain this? What are the birth times for the block pointers? Are you referring to zs_gen or zs_ctime? I've looked at both and both don't appear to be an effective substitute for checksums.

➜  httm git:(master) ✗ cat /sys/module/zfs/parameters/zfs_nopwrite_enabled
1
➜  httm git:(master) httm --list-snaps --unique=metadata -n ~/.zshrc | xargs -I{} sudo zdb -vvvvvv -O "{}" .zshrc | grep crtime
	crtime	Thu Oct 27 23:35:57 2022
	crtime	Thu Oct 27 23:35:57 2022
	crtime	Wed Nov 16 14:25:41 2022
	crtime	Wed Nov 16 14:25:41 2022
	crtime	Wed Nov 16 14:25:41 2022
	crtime	Wed Nov 16 14:25:41 2022
	crtime	Wed Nov 16 14:25:41 2022
	crtime	Wed Nov 16 14:25:41 2022
➜  httm git:(master) httm --list-snaps --unique=metadata -n ~/.zshrc | xargs -I{} sudo zdb -vvvvvv -O "{}" .zshrc | grep gen
	gen	10748
	gen	10748
	gen	606838
	gen	606838
	gen	606838
	gen	606838
	gen	606838
	gen	606838
➜  httm git:(master) httm --list-snaps --unique=metadata -n ~/.zshrc | xargs -I{} sudo zdb -vvvvvv -O "{}" .zshrc | grep cksum
               0 L0 0:243be8000:1000 1000L/1000P F=1 B=10748/10748 cksum=14a416247c1:2bc6413cddfb8:3aab0c70e8bbef6:ad1e9fa799317375
               0 L0 0:8a420c8000:1000 1800L/1000P F=1 B=48522/48522 cksum=1079408c2eb:241d7e1e32c9d:306332a3153ee63:4aeab7b2cab4042
               0 L0 0:ace21b5000:1000 1800L/1000P F=1 B=606838/606838 cksum=1079408c2eb:241d7e1e32c9d:306332a3153ee63:4aeab7b2cab4042
               0 L0 0:b9ee98f000:1000 1800L/1000P F=1 B=1899664/1899664 cksum=10f8eeb09e3:26b1d81b707fa:3588e78093a5024:6ab887218728628f
               0 L0 0:4ef4de7000:1000 1800L/1000P F=1 B=2225628/2225628 cksum=118940edf93:26cc9a83f3257:34c5801da3fa14e:5707109b8bcccbbc
               0 L0 0:54e58b2000:1000 1800L/1000P F=1 B=2246335/2246335 cksum=118940edf93:26cc9a83f3257:34c5801da3fa14e:5707109b8bcccbbc
               0 L0 0:a44430c000:1000 1800L/1000P F=1 B=2537184/2537184 cksum=118940edf93:26cc9a83f3257:34c5801da3fa14e:5707109b8bcccbbc
               0 L0 0:a44445b000:1000 1800L/1000P F=1 B=2537196/2537196 cksum=118940edf93:26cc9a83f3257:34c5801da3fa14e:5707109b8bcccbbc
➜  httm git:(master) httm --unique=metadata -n ~/.zshrc | xargs -I{} cksum "{}"
3137138837 3866 /home/kimono/.zfs/snapshot/autosnap_2022-10-28_05:01:33_monthly/.zshrc
1187566015 5996 /home/kimono/.zfs/snapshot/autosnap_2022-11-01_00:01:38_monthly/.zshrc
1187566015 5996 /home/kimono/.zfs/snapshot/autosnap_2023-01-01_00:00:23_monthly/.zshrc
1847129262 5997 /home/kimono/.zfs/snapshot/autosnap_2023-02-01_00:00:29_monthly/.zshrc
1854486599 5995 /home/kimono/.zfs/snapshot/autosnap_2023-02-14_00:00:17_weekly/.zshrc
1854486599 5995 /home/kimono/.zfs/snapshot/snap_df89becc_prepApt/.zshrc
1854486599 5995 /home/kimono/.zfs/snapshot/snap_2023-03-02-14:06:25_httmSnapFileMount/.zshrc
1854486599 5995 /home/kimono/.zfs/snapshot/autosnap_2023-03-03_02:00:39_hourly/.zshrc
1854486599 5995 /home/kimono/.zshrc

Notwithstanding above, I'd be very pleased if I/we could get this done with a statx call, or simply by exposing an interface that already exists like zfs_stat.

Mar 03 '23 21:03 kimono-koans

First, nopwrite isn't going to matter if you're using fletcher4, it only applies for stronger checksums.

Second, the B= for the top-level block in the tree there is probably closest to the value you want, as if that changed, something underneath it changed. crtime is creation time of the file object, so that will not change unless you do, say, cp file1 file2 && mv file2 file1, for a given filename, no matter what the contents does, but birth time for the root object of the tree of meta+data for the file is always going to change if any of the contents does.

(Top here meaning not the L0 data records, but the first one listed in the output of records, the metadata blocks describing where those data blocks are.)

(That would, notably, for example, not catch things like twiddling the mtime on the file, or xattrs, but I guess it's a question of what kind of change you want to capture...)

Mar 03 '23 21:03 rincebrain

@rincebrain is right to point to: https://github.com/openzfs/zfs/pull/12837

I think performance described there, and what I've experienced using zdb would be a blocker. I have experimented with just using zdb with super user privileges, and for 1) my pathological case of a 4MB file with several same sized versions, and 2) a multi GB file with two same sized versions, the comparison operation for both took about 3.5 seconds, whereas I've been able to get the brute force search (read back and hash all file versions) operation down to 300ms for case 1). FWIW, a zdb lookup on an object is about 250ms per object, and a typical soup to nuts execution of httm searching for unique file versions with formatted output, using only metadata, is sub-15ms.

And it seems apparent ZFS is what is taking so long to return those stats in this use case. zfs diff seems pretty fast, but I assume that's just because it's just iterating through a list of objects. If I knew more about C, I'd help review that PR!

I have also taken a look at the interfaces @robn suggested:

https://github.com/openzfs/zfs/blob/master/include/sys/zfs_stat.h#L42 https://github.com/openzfs/zfs/blob/master/lib/libzfs/libzfs_diff.c#L67

I suppose exposing something like the interface above, with some sort of access to the hashes, would be the best path forward. TBC I really have no idea about what type of interface/how to expose such information would be the most appropriate, other than I'd like that it be accessible through a library, and if you/the project absolutely must gate access to such information as privileged, that you allow a program executed by user with zfs allow privileges to access.

I hope https://github.com/openzfs/zfs/pull/12837 gets resolved soon. Thanks for both of your help.

Mar 06 '23 17:03 kimono-koans

For the record, the B= segment in a zdb line is the logical and physical birth times of the block in question. zfs diff operates in basically the same way as zfs send, using these to only explore the parts of the block tree that have been changed. If you want more details about how send (and diff) work, you can check out some OpenZFS Dev Summit talks on the subject; this link is a few years old at this point but the basic function of incremental send hasn't changed since then, and that's the logic we're talking about here.

Mar 13 '23 17:03 pcd1193182

I've realized that my thinking about this issue had become a little inflexible.

I'm pretty certain that all the same things I've outlined as benefits achieved by knowing the underlying checksums, could probably also be achieved by knowing that all the block pointers (?) for one object point to the same data on disk as another object.

So -- it would seem something like FIEMAP support (?) might resolve this issue, if the information is similar to the information provided by filefrag:

sudo filefrag -v ./file
Filesystem type is: ef53
File size of ./file (153600 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    51200            2048 
   1    2048   237441    53247    384 
   2    2432    76160   237824    512 
   3    3072    76800    76671   1152 
   4    4224    53376    77951   2048 
   5    6272    77952    55423   1920 
   6    8192    67456    79871    128 
   7    8320    55424    67583   1028 
   8    9348   230981    56451    380 
   9    9728    56832   231360    160 
  10    9888    34424    56991     92 
...

See also: https://github.com/openzfs/zfs/pull/9554, and https://github.com/openzfs/zfs/issues/264.

Am I missing something? Can I obtain this information for ZFS through other means?

Thanks.

Apr 05 '23 18:04 kimono-koans

I don't think that information is sufficient for you - imagine if I overwrote section 128k-256k with something that happened to compress identically (say, by not compressing at all), but was different data, you would see the same output from something like filefrag -v, but different content.

I agree that DVA information would probably suffice for knowing whether the same region in two different things is the same data (though not different copies of the same data), but what filefrag is reporting is not. (BPs also include the checksum data, so...)

Apr 05 '23 18:04 rincebrain

zfs zfs copied to clipboard

[Feature Request]: Expose certain `zdb` object information without super user privileges

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

Additional context

zfs
zfs copied to clipboard