sqlite3.dart
sqlite3.dart copied to clipboard
Web implementation: SqliteException(10): while executing statement, disk I/O error, disk I/O error (code 10)
When our users use multiple tabs at the same time, we get lots of error reports like this:
SqliteException(10): while executing statement, disk I/O error, disk I/O error (code 10)
coinciding with hangs in our web app, in the form of DB queries not completing (and most if not all of our users are using the opfsLock implementation).
Do you know how we can investigate the root cause of this?
(we're using the latest Drift version)
Sorry for the slow response. To diagnose this further, it would be good to confirm whether this appears to corrupt the database in any way (e.g. if you close the other tabs, is it functional again or did we mess up locks causing permanent problems).
I will also try to reproduce this independently, my best guess at the moment is that there's an issue with locks possibly being held to long (but that should indicate a locked error instead of a generic disk I/O error).
No problem at all.
We'll investigate on our end to get a clearer picture of what makes the DB work again and what doesn't, since up to now people have been doing all sorts of things to mitigate, like clearing all cache, or restarting the browser.
Also getting these sorts of errors from time to time:
SqliteException(11): while selecting from statement, database disk image is malformed, database disk image is malformed (code 11)
Causing statement: SELECT * FROM "inbox_data" WHERE "discussion_id" = ?;
On a client whose logs included
Using WasmStorageImplementation.sharedIndexedDb
Even more intriguingly, the sort of query that this is means that the database has successfully opened before, and the app was just doing it's normal operations. Also worthy of note, to detect these sorts of errors beforehand, right after opening the database and before signaling it as ready to be used, I perform a SELECT * FROM X LIMIT 1 on each table to make sure the database is really readable, and delete+recreate if that fails. This means that the first read worked on the table, and after some time, it stopped working, saying it's malformed.
I'm beginning to wonder how safe sharedIndexedDB is to multi-tab usage (we've had to move away from opfsLock because of issues the security headers were causing). Do I have to do a coordination layer to prevent multiple tabs from using the same database at the same time?
I'm quite a bit perplexed, because a non-trivial number of users are reporting that the app becomes unresponsive (with corresponding error reports related to SQLite) and I'm having a hard time understanding how I can make our app resilient to IO failures at that level.
I'm beginning to wonder how safe sharedIndexedDB is to multi-tab usage (we've had to move away from opfsLock because of issues the security headers were causing)
That shouldn't be necessary, sharedIndexedDb will host the database in a single shared worker that all tabs connect to. So the database is not actually opened multiple times.
One caveat with IndexedDB is that it's asynchronous, whereas sqlite3 expects a synchronous file system. To work around this issue, we're loading the entire database into memory when opening the database and we then start issuing writes asynchronously after updating a chunk in-memory. This generally works, but is really unsafe if any other tab is opening the database outside of the shared worker - but if they're all using sharedIndexedDb, that shouldn't be an issue unless you're sometimes opening the database in a different way.
I'm also not understanding the corruption issue (especially since you're checking the table on startup) - since the entire copy is in-memory, it seems unlikely that a corruption is possible there. Even if that took place due to writes from different tabs, they'd only see the corruption after re-opening the database, not in the middle of a connection.
I wonder how it works when 2 tabs are on the same web page on Android Chrome where shared workers are not supported . I guess we could expect that only one tab is active at a time but is there a way to 'reload' in memory for other workers if the content is modified in one tab and loaded in another?
For Chrome on Android it's a problem indeed, but sharedIndexedDb wouldn't get chosen there, it would be unsafeIndexedDb.
There's no reload functionality at the moment, but if you use an explicit synchronization wrapper e.g. with the web locks API, you could explicitly close and re-open the database to make it fetch the data again. That's probably not any less efficient that a full reload would be either way.
Just to add more info, I'm also seeing:
SqliteException(26): while executing, file is not a database, file is not a database (code 26)
Causing statement: CREATE INDEX IF NOT EXISTS processed_operations_per_db_idx ON processed_operations(operation_id DESC, db_name);, parameters:
I have the same error.
Do you have some suggestion how to work around that?
From what I had read above, when all tabs would use sharedIndexedDb it would be ok, but I cannot find a way to specify this.
@Musta-Pollo How are you opening the database on the web?
I am opening it like this. I decided to use new DB file for each tab. That is enough for my current demo:
static QueryExecutor _openConnection() {
// `driftDatabase` from `package:drift_flutter` stores the database in
// `getApplicationDocumentsDirectory()`.
// Currently having multiple tabs open in the same browser is not supported.
// So it is important to have unique database names for each tab.
final dbName = UniversalPlatform.isWeb
? 'db_name ${const Uuid().v4()}'
: 'my_database2';
return driftDatabase(
name: dbName,
web: DriftWebOptions(
sqlite3Wasm: Uri.parse('sqlite3.wasm'),
driftWorker: Uri.parse('drift_worker.js'),
onResult: (result) {
if (result.missingFeatures.isNotEmpty) {
debugPrint(
'Using ${result.chosenImplementation} due to unsupported '
'browser features: ${result.missingFeatures}',
);
}
},
),
);
}
But is there something else I can do?
Viz this:
https://github.com/Musta-Pollo/custom_supabase_drift_doc_sync/blob/e184110ac0c016548d0f9cfe7cf6fff3c3596532/lib/db/database.dart#L108C3-L133C4
Do you see what result.chosenImplementation is when the error happens?
Here is the chosenImplementation:
Using WasmStorageImplementation.opfsLocks due to missing browser features: {MissingBrowserFeature.dedicatedWorkersInSharedWorkers}
I am using Arc Browser, which is based on Chromium.
I can reliably reproduce this using opfsLocks by having two tabs simultaneously try to write at the same time.
Enabling debug logs in the worker indicates the cause is lock contention
21:17:08.709 Could not obtain sync handle (attempt 1 / 6) sqlite.worker.dart.js:4116:78
21:17:08.709 Could not obtain sync handle (attempt 2 / 6) sqlite.worker.dart.js:4116:78
21:17:08.709 Could not obtain sync handle (attempt 3 / 6) sqlite.worker.dart.js:4116:78
21:17:08.709 Could not obtain sync handle (attempt 4 / 6) sqlite.worker.dart.js:4116:78
21:17:08.709 Could not obtain sync handle (attempt 5 / 6) sqlite.worker.dart.js:4116:78
21:17:08.710 Caught VfsException(10) while handling WorkerOperation.xWrite(Instance of 'minified:J')
Looking at the code this seems like an inevitability if two instances perform writes at the same time, they will both race to perform the write and one of them will fail to acquire the file. Ultimately this seems like an inherent limitation of the opfsLocks approach, as file handles are held by idle workers for up to 150ms, it is highly unlikely that the retry loop will retry for long enough.
I can see a few solutions, although I'm far from an expert and so really just spitballing here:
Backoff And Retry
The worker could sleep for 150ms before trying again, this is simple and would improve the likelihood of success, but ultimately doesn't really fix the issue, just makes it more unlikely
Navigator Locks
More of an application-level workaround, but applications could use Navigator.locks to ensure only one instance of the application is "active" at a given time, potentially with a BroadcastChannel to provide the ability to make the currently focused pane "active" - a la WhatsApp Web.
In theory this library could add an AccessMode.Exclusive that just opens the file handle in a dedicated worker and accepts that this will error if the same database is opened in multiple tabs concurrently.
Shared Worker Coordination
The model that has a shared worker spawn a dedicated worker that owns the database works flawlessly. Unfortunately it only works on Firefox. However, I think there is a way we can get something similar on all bar chrome for android (which doesn't currently support SharedWorker).
The basic idea is for each tab to communicate with a shared worker, in much the same way as the aforementioned throughSharedWorker, however, rather than having the shared worker spin up a dedicated worker, instead every tab spins up their own dedicated worker that can host databases and forwards the port to the sharedWorker.
Each dedicated worker then sits on Navigator.locks to elect a leader. On acquiring the lock, the dedicated worker notifies the shared worker that it is the leader, and the shared worker routes messages to it.
The big problem with this approach is what happens if the leader worker is terminated whilst a request is in-flight from another tab. I don't think the sharedWorker can blindly retry as we have no way of knowing if the operation completed or not, and so I think it will have to return an error to the calling application which will need to retry the transaction.
Edit: The new read-write-unsafe mode added in https://github.com/simolus3/sqlite3.dart/commit/8e998befee7d289cc1bb14051b4a261f9b99a2ab#diff-75ac4334395ade2885cad3dd011d6285050d7a515a644e8360ed0e03ef7df8e0 also avoids this issue, although this doesn't help safari.
Thanks for the investigation! Are you using package:sqlite3 directly? In the latest version of package:sqlite3_web, I've added feature-detection to use readwrite-unsafe if it's available. That version also adds navigator locks around using the database for both rw-unsafe (you really shouldn't use rw-unsafe without external locks, we don't implement xLock on the web, so races could lead to corruption) and the SharedArrayBuffer implementation.
We can't use navigator locks in package:sqlite3 itself because access to the database is represented through synchronous APIs in that package.
I can confirm in the latest version my reproducer no longer works 🎉
That being said I am struggling to see precisely why, whilst it does now wrap the individual operations in locks, AFAICT nothing stops the VFS from continuing to hold the file open once the lock has been released. Is it possible it is just the latency of WebLocks that is masking the issue?
AFAICT nothing stops the VFS from continuing to hold the file open once the lock has been released
We do that in _releaseImplicitLocks(), called at the end of some operations or if the worker times out waiting for a request. So it does look like the WebLocks delay is helping here, but it's still surprising given that the 150ms timeout is quite long.
having two tabs simultaneously try to write at the same time
To be fair, SQLite also does an xUnlock at the end of the write and we'll close the handle there. So I kind of understand why this fixes the issue with concurrent writes, but I imagine it could still be an issue with reads (where we have an implicit lock that only gets closed asynchronously). So I think adding a short delay between attempts to open the sync access handle would still be an improvement.