zfs Add TXG timestamp database

Motivation and Context

This feature enables tracking of when TXGs are committed to disk, providing an estimated timestamp for each TXG.

With this information, it becomes possible to perform scrubs based on specific date ranges, improving the granularity of data management and recovery operations.

Description

To achieve this, we implemented a round-robin database that keeps track of time. We separate the tracking into minutes, days, and years. We believe this provides the best resolution for time management. This feature does not track the exact time of each transaction group (txg) but provides an estimate. The txg database can also be used in other scenarios where mapping dates to transaction groups is required.

How Has This Been Tested?

Create pool
write data
wait some time
write data
wait some time
try to scrub different times

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[x] I have updated the documentation accordingly.
[x] I have read the contributing document.
[ ] I have added tests to cover my changes.
[ ] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

Dec 11 '24 09:12 oshogbo

It crashes on VERIFY(!dmu_objset_is_dirty(dp->dp_meta_objset, txg)).

Dec 12 '24 17:12 amotin

This reminds me we recently added ddp_class_start into the new dedup table entries format to be able to prune DDT based on time. I wonder if we could save some space would we have this mechanism back then.

Dec 12 '24 17:12 amotin

Forgot to mention this earlier - can you add a test case to exercise zpool scrub -S|-E? Please include all weird edge cases, like invalid dates/ranges, setting timezones forward/backwards, and testing -S|-E against pools where the feature isn't enabled.

Feb 03 '25 17:02 tonyhutter

Forgot to mention this earlier - can you add a test case to exercise zpool scrub -S|-E? Please include all weird edge cases, like invalid dates/ranges, setting timezones forward/backwards, and testing -S|-E against pools where the feature isn't enabled.

Unfortunately, I don't have an idea how to add such test, as to test it we would need to wait for rrd to be created. This will create very long test. Do you have some suggestions?

Feb 03 '25 18:02 oshogbo

This will create very long test. Do you have some suggestions?

The test case could temporarily set the system clock forward to simulate the passage of time.

Feb 03 '25 19:02 tonyhutter

Can we re-run the tests? It seems they have timed out. I don’t see any indication of an error, at least for now.

Apr 15 '25 11:04 oshogbo

@oshogbo Many of them actually crashed on the same assertion:

[ 6543.510793] VERIFY0(spa->spa_checkpoint_txg) failed (0 == 15)
  [ 6543.511146] PANIC at spa.c:5224:spa_ld_read_checkpoint_txg()
  [ 6543.511407] Showing stack for process 450801
  [ 6543.512011] CPU: 0 PID: 450801 Comm: zpool Kdump: loaded Tainted: P           OE     -------  ---  5.14.0-503.35.1.el9_5.x86_64 #1
  [ 6543.512296] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 6543.512736] Call Trace:
  [ 6543.513202]  <TASK>
  [ 6543.513683]  dump_stack_lvl+0x34/0x48
  [ 6543.515313]  spl_panic+0xd1/0xe9 [spl]
  [ 6543.516916]  ? allocate_cgrp_cset_links+0x89/0xa0
  [ 6543.520874]  ? spl_kmem_alloc_impl+0xb0/0xd0 [spl]
  [ 6543.521241]  ? spl_kmem_alloc_impl+0xb0/0xd0 [spl]
  [ 6543.521597]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.522182]  ? __kmalloc_node+0x4e/0x140
  [ 6543.522598]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.522856]  ? spl_kmem_alloc_impl+0xb0/0xd0 [spl]
  [ 6543.523118]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.523358]  ? __list_add+0x12/0x30 [spl]
  [ 6543.523650]  ? __dprintf+0x120/0x190 [zfs]
  [ 6543.539257]  spa_ld_read_checkpoint_txg+0x194/0x1d0 [zfs]
  [ 6543.539724]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.539858]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.539988]  ? spa_import_progress_set_notes_impl+0x103/0x200 [zfs]
  [ 6543.540410]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.540569]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.540709]  ? spa_import_progress_set_notes+0x5b/0x80 [zfs]
  [ 6543.541124]  spa_load_impl.constprop.0+0x10d/0x720 [zfs]
  [ 6543.541558]  spa_load+0x76/0x140 [zfs]
  [ 6543.542288]  spa_load_best+0x138/0x2c0 [zfs]
  [ 6543.542930]  spa_import+0x28a/0x780 [zfs]
  [ 6543.543526]  ? free_unref_page+0xf2/0x130
  [ 6543.543762]  zfs_ioc_pool_import+0x140/0x160 [zfs]
  [ 6543.544348]  zfsdev_ioctl_common+0x690/0x760 [zfs]
  [ 6543.544959]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.545230]  ? _copy_from_user+0x27/0x60
  [ 6543.545593]  zfsdev_ioctl+0x53/0xe0 [zfs]
  [ 6543.546157]  __x64_sys_ioctl+0x8a/0xc0
  [ 6543.546526]  do_syscall_64+0x5f/0xf0
  [ 6543.546826]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 6543.547096]  ? exc_page_fault+0x62/0x150
  [ 6543.547411]  entry_SYSCALL_64_after_hwframe+0x78/0x80
  [ 6543.547966] RIP: 0033:0x7ff05670313b
  [ 6543.550135] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 4c 0f 00 f7 d8 64 89 01 48
  [ 6543.550491] RSP: 002b:00007ffe5bd417f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [ 6543.550779] RAX: ffffffffffffffda RBX: 000056329756ce10 RCX: 00007ff05670313b
  [ 6543.551046] RDX: 00007ffe5bd42960 RSI: 0000000000005a02 RDI: 0000000000000003
  [ 6543.551290] RBP: 00007ffe5bd45f60 R08: 0000000000000003 R09: 0000000000000000
  [ 6543.551529] R10: 0000000010000000 R11: 0000000000000246 R12: 00007ffe5bd41960
  [ 6543.551779] R13: 000056329755c2e0 R14: 00007ffe5bd42960 R15: 00007ff050002da8
  [ 6543.552096]  </TASK>

Apr 15 '25 13:04 amotin

I think everything should be fixed now.

May 22 '25 10:05 oshogbo

I have addressed the feedback.

May 23 '25 10:05 oshogbo

We need some more eyes here.

Jun 06 '25 03:06 amotin

Also the test probably needs to be added to one or more runfiles?

Also, it looks like the test doesn't work in at least some contexts. date --set doesn't work on any of the Ubuntu VMs I have to play with.

Jun 12 '25 18:06 pcd1193182

What is the user experience like here if their clock changes? It seems like because of the linear search we will always find the first time interval in each DB that matches, so theoretically if your clock backwards and then you have a problem, the date range you specify will hit the first time the clock hit that point, rather than the second. So you might end up not scrubbing the time in question? We might want to put a note about that somewhere in the docs for this feature.

I have added the note to the man page.

Also, it looks like the test doesn't work in at least some contexts. date --set doesn't work on any of the Ubuntu VMs I have to play with.

I have used Ubuntu VM and it works fine.

$ lsb_release  -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy
$ date --version
date (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by David MacKenzie.

Jun 18 '25 10:06 oshogbo

Also the test probably needs to be added to one or more runfiles?

I had missed this one, but now it should be addressed.

Jun 20 '25 16:06 oshogbo

It seems that ntpd, is correcting the time in CI/CD:

  2 Tue Jun 10 17:16:05 UTC 2025
  3 SUCCESS: date --set=-10 days
  4 SUCCESS: zpool create -o failmode=continue testpool2 /var/tmp/testdir/vdev_a
  5 SUCCESS: dd if=/dev/random of=/testpool2/0_file bs=1M count=1
  6 SUCCESS: zpool export testpool2
  7 SUCCESS: zpool import -d /var/tmp/testdir testpool2
  8 Sat Jun 21 17:16:06 UTC 2025
  9 SUCCESS: date --set=+1 days

Because of that the test are failing. I had the same thing on my machine, disabling ntpd daemon helps here.

Let me know what you want me to do. For now I will remove it from runfiles.

Jun 20 '25 19:06 oshogbo

@oshogbo zpool_scrub/zpool_scrub_date_range_001 test failed on almalinux8.

Jul 11 '25 18:07 amotin

I think the problem with almalinux8 is finally solved. It was caused by using /dev/random instead of /dev/urandom, which resulted in empty files. The zinject tool injects a data error, but because there was no data, the second file wasn't detected as corrupted.

I also found an issue with timezone calculation, which has now been fixed.

Additionally, I changed the way we select the final time. Since we have three different groups of timestamps, we can't simply select the smallest TXG as the start date - doing so would always pick the one from the "lowest frequency group" (monthly). So instead, we still floor each group, but we now select the time that is closest overall. Hope that makes sense.

Jul 28 '25 06:07 oshogbo

It looks like we had some unexpected failures in the centos-stream builders as well which need to be looked at:

https://github.com/openzfs/zfs/actions/runs/16548417061/job/46985751681?pr=16853

 Tests with results other than PASS that are unexpected:
    FAIL cli_root/zpool_scrub/zpool_scrub_date_range_001 (expected PASS)

Jul 30 '25 19:07 behlendorf