litestream icon indicating copy to clipboard operation
litestream copied to clipboard

validation error: invalid replica, checksum mismatch

Open mtlynch opened this issue 2 years ago • 2 comments

I set validation-interval: 12h, and I noticed the most recent validation failed even though another one 12 hours before succeeded

2021-12-06T05:47:55.634Z,262d,,/app/data/store.db(s3): validator: status=ok db=65547bc672840f1a replica=65547bc672840f1a pos=9ed130e36a449e5a/00000014:0
...
2021-12-06T17:48:13.745Z,262d,,/app/data/store.db(s3): validator: status=mismatch db=7b6620fd7258d76e replica=34df88fca9fc85d1 pos=9ed130e36a449e5a/00000023:0
2021-12-06T17:48:13.745Z,262d,,"/app/data/store.db(s3): validation error: invalid replica, checksum mismatch"

Full logs: https://gist.github.com/mtlynch/be76d417116408e0d9eeba56b31e8439

This is running Litestream v0.3.7.

I can restore a snapshot with litestream on a separate machine, and both snapshots and generations look fine:

$  litestream snapshots -config litestream.yml data/store.db
replica  generation        index  size     created
s3       12529930f7626a77  0      1402513  2021-12-05T12:12:25Z
s3       1f969177165606d9  0      1399832  2021-12-04T18:16:24Z
s3       302db80a656c537a  0      1402442  2021-12-05T13:55:19Z
s3       5dfeb4aa03232553  24     1389251  2021-11-30T04:38:34Z
s3       60d81d9210b967b4  0      1401843  2021-12-04T17:54:40Z
s3       611fa8ec071bae4a  0      1401380  2021-12-04T11:40:02Z
s3       62cdebbae5ab4622  0      1401382  2021-12-04T11:37:55Z
s3       65abfd11d26de878  0      1402513  2021-12-05T11:26:51Z
s3       6fac9e84335e9f46  0      1401473  2021-12-04T19:15:28Z
s3       94b25e3853df6122  0      1402442  2021-12-05T13:14:15Z
s3       97f42f22de601e05  0      1401743  2021-12-04T18:49:00Z
s3       9cedd81af21fad10  0      1402513  2021-12-05T12:55:28Z
s3       9ec3b9e119197011  0      1402442  2021-12-05T14:32:43Z
s3       9ed130e36a449e5a  0      1401846  2021-12-05T17:47:20Z
s3       9ed130e36a449e5a  35     1401664  2021-12-06T17:47:22Z
s3       acc9b6b2e31d5781  0      1402098  2021-12-04T17:18:11Z
s3       ca9cf860627dfcd5  0      1401382  2021-12-04T11:21:01Z
s3       da232cf04e966748  0      1402195  2021-12-04T17:13:13Z
s3       f654f36a2cdf4953  0      1402098  2021-12-04T17:17:30Z

$ litestream generations -config litestream.yml data/store.db
name  generation        lag         start                 end
s3    12529930f7626a77  5h19m28s    2021-12-05T12:12:25Z  2021-12-05T12:12:26Z
s3    1f969177165606d9  22h54m16s   2021-12-04T18:16:24Z  2021-12-04T18:37:38Z
s3    302db80a656c537a  3h36m34s    2021-12-05T13:55:19Z  2021-12-05T13:55:20Z
s3    5dfeb4aa03232553  113h45m1s   2021-11-30T03:40:32Z  2021-11-30T23:46:53Z
s3    60d81d9210b967b4  23h18m43s   2021-12-04T17:54:40Z  2021-12-04T18:13:11Z
s3    611fa8ec071bae4a  24h21m9s    2021-12-04T11:40:02Z  2021-12-04T17:10:45Z
s3    62cdebbae5ab4622  29h52m56s   2021-12-04T11:37:55Z  2021-12-04T11:38:58Z
s3    65abfd11d26de878  6h5m0s      2021-12-05T11:26:51Z  2021-12-05T11:26:54Z
s3    6fac9e84335e9f46  6h37m52s    2021-12-04T19:15:28Z  2021-12-05T10:54:02Z
s3    94b25e3853df6122  4h17m37s    2021-12-05T13:14:15Z  2021-12-05T13:14:17Z
s3    97f42f22de601e05  22h22m56s   2021-12-04T18:49:00Z  2021-12-04T19:08:58Z
s3    9cedd81af21fad10  4h20m31s    2021-12-05T12:55:28Z  2021-12-05T13:11:23Z
s3    9ec3b9e119197011  4m47s       2021-12-05T14:32:43Z  2021-12-05T17:27:07Z
s3    9ed130e36a449e5a  -31h27m24s  2021-12-05T17:47:20Z  2021-12-07T00:59:19Z
s3    acc9b6b2e31d5781  23h58m28s   2021-12-04T17:18:11Z  2021-12-04T17:33:26Z
s3    ca9cf860627dfcd5  30h10m44s   2021-12-04T11:21:01Z  2021-12-04T11:21:10Z
s3    da232cf04e966748  24h18m40s   2021-12-04T17:13:13Z  2021-12-04T17:13:14Z
s3    f654f36a2cdf4953  24h14m23s   2021-12-04T17:17:30Z  2021-12-04T17:17:31Z

Is the negative lag on generation 9ed130e36a449e5a expected behavior?

I'll send you the cloud storage backup privately.

mtlynch avatar Dec 07 '21 01:12 mtlynch

@mtlynch Thanks for the bug report. I can take a look at it more in-depth tomorrow but I believe it's a known issue with validation-interval. There seems to be a timing issue where running validation and snapshotting at the same time causes validation to report a mismatch, however, restoring the database doesn't show any corruption.

It looks like the snapshot occurred right at the same time as the validation:

2021-12-06T17:47:28.485Z,262d,,/app/data/store.db(s3): snapshot written 9ed130e36a449e5a/00000023
2021-12-06T17:47:28.485Z,262d,,/app/data/store.db(s3): restoring snapshot 9ed130e36a449e5a/00000000 to /tmp/416827390-litestream/replica.tmp

Validation performs a byte-for-byte match of the current local database and a fully restored copy so it's really exact in trying to get the snapshot perfect. I believe the timing issue is just that it compares against slightly different points in time and not that the data is actually corrupted. It'd be curious if you continue to see additional validation errors afterward for the same generation. As a workaround, I typically offset the validation interval from the snapshot interval so it's something like 12h13m. Not a great solution but works ok for testing.

Also, I'd add that validation can get quite expensive as S3 tends to charge for egress so just watch your costs when using it.

benbjohnson avatar Dec 07 '21 01:12 benbjohnson

Ah! That's a relief.

As a workaround, I typically offset the validation interval from the snapshot interval so it's something like 12h13m. Not a great solution but works ok for testing.

Ah, that's a good idea. I'll add that.

Also, I'd add that validation can get quite expensive as S3 tends to charge for egress so just watch your costs when using it.

Yep, I saw the warning in the docs, but I think I should be okay. I'm backing up to Backblaze, which includes 1 GB of free egress per day, and the generations in my bucket tend to be <100MB.

mtlynch avatar Dec 07 '21 02:12 mtlynch