Read-replica VFS issues.
Environment
Litestream version:
0.5.0
Operating system & version: Debian GNU/Linux forky
Installation method: binary download, built from source
Storage backend: file
Bug Description
This may be way too early, but I was trying to "port" the (just released) read-replica VFS to my SQLite driver, and I've bumped into a few issues, which may cause you future pain, so I decided to open this to discuss them here.
If it is too early, feel free to ignore me and close the issue, I genuinely just wanted to help. Hopefully this reads "helpful" and not "demanding".
VFS Open
Open is trying to handle all file types using the ReplicaClient. This may seem that it mostly works but, depending on settings, SQLite will eventually try to use the VFS to open temporary files.
IIUC, for Litestream, it should be sufficient to open SQLITE_OPEN_MAIN_DB files and refuse everything else.
However, unless the user is instructed to configure SQLite to use memory for all temporary files, SQLite will eventually try to at least SQLITE_OPEN_TEMP_JOURNAL to handle large sorts (such as an ORDER BY of millions of records). Small tests won't surface this, though.
Also, Open can probably or SQLITE_OPEN_READONLY to the flags it returns.
Cancel polling replication
This is probably an oversight, wg, ctx, and cancel are all there to handle this, but unused.
Handle sub-page reads
Since you're not returning SQLITE_IOCAP_SUBPAGE_READ, SQLite 3.48.0 and above will mostly avoid sub-page reads (otherwise, overflow BLOBs would use them). However, SQLite will use sub-page to read the database header in the first page, and your code here is buggy.
SQLite will read 100 bytes at offset 0 to load the entire header and, when in rollback mode, will read 16 bytes at offset 24 to check for changes.
When reading the first page of the database you're setting the file format version numbers to 1 which forces rollback mode, and may be one way of handling a readonly VFS that gets updates from other writers.
However, when SQLite reads the 16 bytes at offset 24, you're returning the wrong data. This line doesn't copy the correct bytes: https://github.com/benbjohnson/litestream/blob/4e7ca3d82c2c018ac20a4af385af3a3fd9e52183/vfs.go#L193
You'd need to update it to n = copy(p, data[off%4096:]) to "fix" that, but if you try, you'll start missing updates, because the file change counter is not updated on every transaction in WAL mode (but is expected to in rollback mode).
Correctly handling updates
My suggestion for a simple fix for the above issue is this:
// Update the first page to pretend like we are in journal mode,
// and there were changes to the database.
if pgno == 1 {
data[18], data[19] = 0x01, 0x01
rand.Read(data[24:28])
}
n = copy(p, data[off%4096:])
f.logger.Info("data read", "n", n, "data", len(data))
However, this means SQLite will drop it's entire page cache at the start of every transaction, which is sub-optimal if there are no changes. You should probably only update when there are actual changes.
Correctly handling locking, "concurrent" writes
There's another issue where I probably don't really understand Litestream and LTX well enough to help much, but how do you plan to implement locking?
IIUC, with the above fix things will probably work with small tests (which update a single page), but will start failing at anything more complicated.
In theory the correct way to do things is "snapshot" the database when Lock(LockShared) is called, and ensure all reads are satisfied from that "snapshot" until Unlock(LockNone) is called. Between those, maybe you don't even need to "poll" the replica?
I tried to only pollReplicaClient() on Lock(LockShared) (and remove the background goroutine) and it seems to work fine. Then I can also update the change counter if there are any itr.Next(), remove mutexes, etc.
I applied the above changes to my version, ~and everything seems to work fine, but it needs more serious testing~.
Actually I'm having problems with compactions, after a compaction I get missing LTX files errors, and it seems I need to rebuild the index from scratch to recover.
I also tried to avoid assuming 4096 as the page size, and read it from the database header, which seems to work. But litestream 0.5 seems broken for page sizes other than 4096 (database doesn't replicate at all with the CLI). Is that expected?
The current approach seems fundamentally problematic if used to open with a continuously updating snapshot.
The v0.5.0 code tries to update the index in the background, and always reads the most recent version of a page. This won't corrupt your data because it's a read-only process, but it's a great way to get the issues described when using immutable in any non trivial capacity:
Caution: Setting the immutable property on a database file that does in fact change can result in incorrect query results and/or
SQLITE_CORRUPTerrors. See also:SQLITE_IOCAP_IMMUTABLE.
Changing the logic to only update the index when a transaction starts fixes this, but the way the index is being updated in pollReplicaClient seems flawed under compaction. You quickly get missing LTX files errors, that you can't recover from.
Instead, I'm rebuilding the index from scratch whenever there are updates (I'm not sure what the optimal way of doing this is), but even then I'm not sure this is enough.
The only solution, IMO, is to decide how long you want to allow transactions to last (e.g. 10s) and then ensure you keep L0 files for at least that long. You can either choose to keep them a bit longer after they're compacted, or you can decide never to compact the most recent few files. Both are probably OK. But deleting a file that was created a couple of seconds ago because it's already been compacted is not OK.
Expanding a bit on the issue, the retention policy as implemented, is at odds with the VFS in (at least) two ways.
DB.Compact deletes L0 LTX files as soon as it makes a L0->L1 compaction:
https://github.com/benbjohnson/litestream/blob/4e7ca3d82c2c018ac20a4af385af3a3fd9e52183/db.go#L1382-L1390
A possible solution IIUC, is to delete up to minTXID (which matches the comment, BTW).
Store.EnforceSnapshotRetention deletes L1->LN for all snapshots that are kept:
https://github.com/benbjohnson/litestream/blob/4e7ca3d82c2c018ac20a4af385af3a3fd9e52183/store.go#L240-L262
Again the solution here is that DB.EnforceSnapshotRetention should return something else for minSnapshotTXID. This is a bit harder to figure out, but IIUC you probably want the highest info.MinTXID kept (not the lowest info.MaxTXID kept, as you're doing)?
To have rolling “24 hours” of transactions to restore to, ISTM you want to keep the previous day, not throw away everything but the current day. This retains more files, but is fundamental to avoid race conditions.
When SQLite starts a transaction, it wants all reads to go to a consistent “view” of the database, in LTX terms, against an TXID. Obviously, a transaction can't last forever, but even an arbitrarily short 1s (or less) transaction can be started against a TXID that lands in the middle of an L0->L1 compaction, and even if you updated the index in the middle of the transaction, reading L1 compacted pages must fail. Similarly, if you start your transaction just before a snapshot is made, you may lose your LN TXID.
This also seems to cause issues with restores, as you start with a snapshot and then “append” lower level LTXs to it, but if you compact/snapshot after producing the list but before reading the pages, they may be gone already.
@ncruces Thanks for submitting the report. The VFS is very much a POC and is not ready for prime time yet. There's a lot to be done around managing retention enforcement and refetching new files as they become available.
@benbjohnson I understand that. My intent was only to help that along, and share my experience, as I've created a few reasonably sophisticated SQLite VFSes. Not complaining/demanding in any way!
My version is here, and mostly works fine. There's things to improve (around multiple connections, and potentially a page cache), but it should be correct.
However, the VFS will experience transient failures because of how Litestream currently handles retention (it deletes files too early) which I think also affect restoring a replica while a master is live. #786 should mostly fix that, but it would also help if two snapshots where kept (instead of just one). Without those two changes, ISTM the VFS will always experience rare transient read issues.
Feel free to close this at your discretion!
If I could make an additional request, @benbjohnson, it'd be nice if the stuff that the VFS uses didn't use a package that blank imports your SQLite driver.
Anyone who uses either my driver, or even a CGO driver (like mattn) will end up importing all of modernc, and registering the driver at runtime, because the litestream package does this. Even if I copy the code from litestream (which I can easily do) the same will happen if I import litestream/s3 (etc) because those packages import litestream.
Otherwise, my only option is to fork the project and create a minimal library that only restores (hopefully that doesn't depend on any SQLite driver/bindings at all).