validation error: invalid replica, checksum mismatch
I set validation-interval: 12h, and I noticed the most recent validation failed even though another one 12 hours before succeeded
2021-12-06T05:47:55.634Z,262d,,/app/data/store.db(s3): validator: status=ok db=65547bc672840f1a replica=65547bc672840f1a pos=9ed130e36a449e5a/00000014:0
...
2021-12-06T17:48:13.745Z,262d,,/app/data/store.db(s3): validator: status=mismatch db=7b6620fd7258d76e replica=34df88fca9fc85d1 pos=9ed130e36a449e5a/00000023:0
2021-12-06T17:48:13.745Z,262d,,"/app/data/store.db(s3): validation error: invalid replica, checksum mismatch"
Full logs: https://gist.github.com/mtlynch/be76d417116408e0d9eeba56b31e8439
This is running Litestream v0.3.7.
I can restore a snapshot with litestream on a separate machine, and both snapshots and generations look fine:
$ litestream snapshots -config litestream.yml data/store.db
replica generation index size created
s3 12529930f7626a77 0 1402513 2021-12-05T12:12:25Z
s3 1f969177165606d9 0 1399832 2021-12-04T18:16:24Z
s3 302db80a656c537a 0 1402442 2021-12-05T13:55:19Z
s3 5dfeb4aa03232553 24 1389251 2021-11-30T04:38:34Z
s3 60d81d9210b967b4 0 1401843 2021-12-04T17:54:40Z
s3 611fa8ec071bae4a 0 1401380 2021-12-04T11:40:02Z
s3 62cdebbae5ab4622 0 1401382 2021-12-04T11:37:55Z
s3 65abfd11d26de878 0 1402513 2021-12-05T11:26:51Z
s3 6fac9e84335e9f46 0 1401473 2021-12-04T19:15:28Z
s3 94b25e3853df6122 0 1402442 2021-12-05T13:14:15Z
s3 97f42f22de601e05 0 1401743 2021-12-04T18:49:00Z
s3 9cedd81af21fad10 0 1402513 2021-12-05T12:55:28Z
s3 9ec3b9e119197011 0 1402442 2021-12-05T14:32:43Z
s3 9ed130e36a449e5a 0 1401846 2021-12-05T17:47:20Z
s3 9ed130e36a449e5a 35 1401664 2021-12-06T17:47:22Z
s3 acc9b6b2e31d5781 0 1402098 2021-12-04T17:18:11Z
s3 ca9cf860627dfcd5 0 1401382 2021-12-04T11:21:01Z
s3 da232cf04e966748 0 1402195 2021-12-04T17:13:13Z
s3 f654f36a2cdf4953 0 1402098 2021-12-04T17:17:30Z
$ litestream generations -config litestream.yml data/store.db
name generation lag start end
s3 12529930f7626a77 5h19m28s 2021-12-05T12:12:25Z 2021-12-05T12:12:26Z
s3 1f969177165606d9 22h54m16s 2021-12-04T18:16:24Z 2021-12-04T18:37:38Z
s3 302db80a656c537a 3h36m34s 2021-12-05T13:55:19Z 2021-12-05T13:55:20Z
s3 5dfeb4aa03232553 113h45m1s 2021-11-30T03:40:32Z 2021-11-30T23:46:53Z
s3 60d81d9210b967b4 23h18m43s 2021-12-04T17:54:40Z 2021-12-04T18:13:11Z
s3 611fa8ec071bae4a 24h21m9s 2021-12-04T11:40:02Z 2021-12-04T17:10:45Z
s3 62cdebbae5ab4622 29h52m56s 2021-12-04T11:37:55Z 2021-12-04T11:38:58Z
s3 65abfd11d26de878 6h5m0s 2021-12-05T11:26:51Z 2021-12-05T11:26:54Z
s3 6fac9e84335e9f46 6h37m52s 2021-12-04T19:15:28Z 2021-12-05T10:54:02Z
s3 94b25e3853df6122 4h17m37s 2021-12-05T13:14:15Z 2021-12-05T13:14:17Z
s3 97f42f22de601e05 22h22m56s 2021-12-04T18:49:00Z 2021-12-04T19:08:58Z
s3 9cedd81af21fad10 4h20m31s 2021-12-05T12:55:28Z 2021-12-05T13:11:23Z
s3 9ec3b9e119197011 4m47s 2021-12-05T14:32:43Z 2021-12-05T17:27:07Z
s3 9ed130e36a449e5a -31h27m24s 2021-12-05T17:47:20Z 2021-12-07T00:59:19Z
s3 acc9b6b2e31d5781 23h58m28s 2021-12-04T17:18:11Z 2021-12-04T17:33:26Z
s3 ca9cf860627dfcd5 30h10m44s 2021-12-04T11:21:01Z 2021-12-04T11:21:10Z
s3 da232cf04e966748 24h18m40s 2021-12-04T17:13:13Z 2021-12-04T17:13:14Z
s3 f654f36a2cdf4953 24h14m23s 2021-12-04T17:17:30Z 2021-12-04T17:17:31Z
Is the negative lag on generation 9ed130e36a449e5a expected behavior?
I'll send you the cloud storage backup privately.
@mtlynch Thanks for the bug report. I can take a look at it more in-depth tomorrow but I believe it's a known issue with validation-interval. There seems to be a timing issue where running validation and snapshotting at the same time causes validation to report a mismatch, however, restoring the database doesn't show any corruption.
It looks like the snapshot occurred right at the same time as the validation:
2021-12-06T17:47:28.485Z,262d,,/app/data/store.db(s3): snapshot written 9ed130e36a449e5a/00000023
2021-12-06T17:47:28.485Z,262d,,/app/data/store.db(s3): restoring snapshot 9ed130e36a449e5a/00000000 to /tmp/416827390-litestream/replica.tmp
Validation performs a byte-for-byte match of the current local database and a fully restored copy so it's really exact in trying to get the snapshot perfect. I believe the timing issue is just that it compares against slightly different points in time and not that the data is actually corrupted. It'd be curious if you continue to see additional validation errors afterward for the same generation. As a workaround, I typically offset the validation interval from the snapshot interval so it's something like 12h13m. Not a great solution but works ok for testing.
Also, I'd add that validation can get quite expensive as S3 tends to charge for egress so just watch your costs when using it.
Ah! That's a relief.
As a workaround, I typically offset the validation interval from the snapshot interval so it's something like 12h13m. Not a great solution but works ok for testing.
Ah, that's a good idea. I'll add that.
Also, I'd add that validation can get quite expensive as S3 tends to charge for egress so just watch your costs when using it.
Yep, I saw the warning in the docs, but I think I should be okay. I'm backing up to Backblaze, which includes 1 GB of free egress per day, and the generations in my bucket tend to be <100MB.