Improve management of iterator returned by `collect` (`database table is locked` error)
Action Items
- [ ] We should check why read-only iterator locks the data for other data chains (it fails on creating tables for UDF for example, or removing tables on cleanup - so, on updating system tables on SQLite). Can we make these read-only stuff a non-blocker in the first place
- [ ] We need to make it a context manager (so that it close itself and release database underneath)
- [ ] We can see if we can improve the error message (database is locked -> "iterator is open" or something)
Description
Related to https://github.com/iterative/datachain-examples/issues/15
Simple way to reproduce:
from datachain import DataChain
ds = DataChain.from_values(fib=[1, 1, 2, 3, 5, 8])
itr = ds.collect("fib")
print(next(itr))
ds = DataChain.from_values(fib=[1, 1, 2, 3, 5, 8])
ds.show()
Causes something like (when you run it within a test suite at least, since it is causing DB cleanup, but the same can be imitated by let's say making the second part a bit more complicated - with UDF):
> result = self.db.execute(*self.compile_to_args(query))
E sqlite3.OperationalError: database table is locked: sqlite_master
/Users/ivan/Projects/dvcx/src/datachain/data_storage/sqlite.py:199: OperationalError
It's caused (I think) by us using:
with super().select(*db_signals).as_iterable() as rows:
...
yield
It is an open connection / cursor staying underneath. It affects probably also other DataChains since we are reusing the same DB.
Workaround
Obvious workaround - use limit (+ offset if needed) to make sure that it's a single (or whatever number of elements is needed) and / or list(...) or iterate through them.
We should check why read-only iterator locks the data for other data chains (it fails on creating tables for UDF for example, or removing tables on cleanup - so, on updating system tables on SQLite). Can we make these read-only stuff a non-blocker in the first place
It doesn't seem like any SQLite connections are opened in read-only mode. They all take both the read and write locks.
I had a look at introducing a read method in the SQLiteDatabaseEngine that would look something like this:
@retry_sqlite_locks
def read(self, query):
with sqlite3.connect(
"file::memory:?mode=ro&nolock=1&cache=shared",
uri=True,
detect_types=DETECT_TYPES,
) as connection:
return connection.execute(*self.compile_to_args(query))
Unfortunately, that doesn't seem to work (at least running as a test), perhaps because you can't open the in-memory database in read-only mode.
We need to make it a context manager (so that it close itself and release database underneath)
Doesn't this go against the decision that we made earlier to avoid using a session as a context manager?
Doesn't this go against the decision that we made earlier to avoid using a session as a context manager?
it's not about session, it's only about this Iterator returned by collect - it should be closable if we can't make DB work in a mode where there is no lock.