FlowKit Race condition between query store and cache clearing

Describe the bug Query.store() constructs the query SQL ~when the query is submitted to the queue~ (via _make_query()). This takes into account the stored state of dependencies. If a dependency is removed from the cache before the queued query is executed, the query will attempt to access a no-longer-existent cache table.

Edit: I was mistaken - _make_query() is already called just before executing the returned SQL (in write_query_to_cache()), not when the query is submitted to the queue. So the race condition will occur only if the dependency is removed from cache after the call to dependency.get_query() (within parent._make_query()), but before the SQL returned by parent._make_query() is executed.

Product flowmachine

Version 1.17.1

Expected behavior A query should not attempt to use results that have been removed from cache. ~Presumably this means we need to call _make_query() in write_query_to_cache(), not in Query.to_sql().~

Additional context I've also found that queries sometimes get stuck in "executing" state with no SQL actually running (and no error, either in flowmachine or flowdb), and this appears to happen at the same time as a cache-clearing task runs. I suspect this is a related issue, but I haven't managed to pin it down yet.

Sep 20 '22 08:09 jc-harrison

Smarter queue would be good really, because you probably don't want to be chucking stuff out of the cache if there's something about to use it. I suppose one could grant eviction immunity to anything currently in the queue as well.

Sep 27 '22 12:09 greenape

I suppose one could grant eviction immunity to anything currently in the queue as well.

I think that's probably a good idea, although it won't directly resolve this issue - dependencies will be out of the queue before queries that depend on them are executed.

Sep 30 '22 10:09 jc-harrison

Ah this is going to be tricky isn't it, because we rely on get_query being blocking. So you can imagine a situation where there are two calls to get_query one returns, we block on a second, and block for sufficiently long that the first query has actually been evicted from cache, and the resulting sql is now wrong. Hmm.

Sep 30 '22 10:09 greenape

So you need to track everything currently in use, and prevent it from being evicted.

Sep 30 '22 10:09 greenape

Ah yes, you're right. From a quick glance I initially thought we were calling self._make_query() when the query is enqueued rather than when it's executed, but that's not the case. The actual issue is more subtle - as you say, one get_query might block long enough that a previous one gets evicted from cache.

I think there are two (separate but related) things we'd like here:

(Essential): A query must not be evicted from cache if something else has called get_query on it but not yet executed the resulting SQL
(Desirable): A query should not be evicted from cache if another query in the queue would use its result

I think the first could be achieved by taking out a "do not evict" lock on a query when getting its SQL (I expect the fiddly bit will be ensuring the lock is cleared once the SQL is no longer required). The second could perhaps be achieved through a similar mechanism when enqueuing a query, but I haven't fully got my head around the consequences of that - I think I need to draw a diagram.

Sep 30 '22 11:09 jc-harrison

I think #2 isn't actually guaranteed to be a good idea tho, right? The queue might be very long, it might not be sensible to hold on to the cache.

I feel like, absent a smarter queue, what you want is to have a lock emanating from cache cleanup, so that you can't construct a query while it is running, and to construct the sql immediately on executing, which would let PG take care of ensuring the tables will all still exist because they'll get locked.

Sep 30 '22 11:09 greenape

I think https://github.com/Flowminder/FlowKit/issues/2 isn't actually guaranteed to be a good idea tho, right? The queue might be very long, it might not be sensible to hold on to the cache.

I suppose that's true. Although the converse is also true - if we store a query because it's a dependency of another query that ends up a long way behind in the queue, and then kick it out of cache before anything uses it, then we've wasted effort running a query for no reason. Ideally we want a smarter store_dependencies that only caches a dependency if doing so is expected to make the overall calculation more efficient - ~if we had that, then I think we'd always want to make sure dependencies don't get kicked out of cache before they're used by their enqueued parents~ (after writing that I realised it's not necessarily true - if two queries a long way apart in the queue share a dependency then it may sometimes be preferable to store the dependency once, use it in the first query, then remove from cache before storing again to use in the second query. But then we'd still ideally want to make sure that (a) a query doesn't get evicted before at least one of the things that asked for it has used it, and (b) the evicted dependency gets re-stored before the next query tries to use it. We've been discussing moving from the current system - considering only the individual query when it comes to cache eviction - to considering the full dependency tree of queries in the queue, but I think the ideal solution would separately consider the dependency tree for each call to Query.store. But that's a lot more complicated, and prone to unexpected tripping-up).

I feel like, absent a smarter queue, what you want is to have a lock emanating from cache cleanup, so that you can't construct a query while it is running, and to construct the sql immediately on executing, which would let PG take care of ensuring the tables will all still exist because they'll get locked.

A lock emanating from cache cleanup won't help if get_query has already been called (but the resulting SQL not yet executed) before the cache cleanup starts, though. I think a lock emanating from get_query is probably cleaner (although if we can find a way to do things a bit differently such that postgres automatically handles that locking, that would be preferable).

Sep 30 '22 12:09 jc-harrison

We can actually lock tables ourselves, which is a possible solution. My worry with locking from get_query is getting things stuck if the managing instance goes away for any reason. Locking from the cache cleanup is viable only if you can do the get_query and execute as an atomic op, which I think may not be doable for us.

Sep 30 '22 13:09 greenape

I was looking at this again - the window is actually much smaller than I'd initially thought, because _make_sql isn't called until https://github.com/Flowminder/FlowKit/blob/master/flowmachine/flowmachine/core/cache.py#L191

Oct 14 '22 14:10 greenape

I was looking at this again - the window is actually much smaller than I'd initially thought, because _make_sql isn't called until https://github.com/Flowminder/FlowKit/blob/master/flowmachine/flowmachine/core/cache.py#L191

Indeed. The window could potentially still be quite long though, if _make_sql blocks on two query.get_query() calls in succession and the second one takes a long time to complete.

Oct 17 '22 09:10 jc-harrison

Additional context I've also found that queries sometimes get stuck in "executing" state with no SQL actually running (and no error, either in flowmachine or flowdb), and this appears to happen at the same time as a cache-clearing task runs. I suspect this is a related issue, but I haven't managed to pin it down yet.

I believe this issue is due to deadlock that should be resolved by #5520. I think the deadlock occurs like this:

Query A is stored
Query B (which has query A as a sub-query) begins executing
Independently, a cache-clearing task begins and tries to drop the cache table corresponding to query A. State of query A changes to "resetting"
Cache-clearing task is blocked by a pg lock on query A's cache table (due to the in-progress "store query B" transaction currently reading from this table), so is forced to wait for the query-store transaction to finish
Query B finishes executing, and moves onto writing the cache metadata (transaction is still open)
The SQL string for query B (to be written into the cache record) is reconstructed by calling _make_query again (with the DB transaction still open, idle in transaction). This calls get_query() on query A, which blocks until query A finishes resetting.
But query A cannot finish resetting until the "store query B" transaction closes, so we're deadlocked.

By removing the blocking call to _make_query within the transaction, #5520 should resolve this. But this doesn't resolve the race condition described in this issue.

Oct 17 '22 10:10 jc-harrison

I was looking at this again - the window is actually much smaller than I'd initially thought, because _make_sql isn't called until https://github.com/Flowminder/FlowKit/blob/master/flowmachine/flowmachine/core/cache.py#L191

Indeed. The window could potentially still be quite long though, if _make_sql blocks on two query.get_query() calls in succession and the second one takes a long time to complete.

Potentially also exacerbated by a small connection pool - if _make_sql returns the SQL but then there's a long wait to obtain a connection to actually run the SQL, that long wait increases the window for a cache-drop task to jump in and drop some required cache tables.

Oct 17 '22 10:10 jc-harrison

I feel like, absent a smarter queue, what you want is to have a lock emanating from cache cleanup, so that you can't construct a query while it is running, and to construct the sql immediately on executing, which would let PG take care of ensuring the tables will all still exist because they'll get locked.

Coming back to this: we essentially already have such a lock (albeit not in postgres), in that the query state is changed to resetting before removing from cache, and query.get_query() will wait_until_complete() which blocks until the cache drop finishes. So if we could make "get query and execute" atomic, this problem should go away.

Mar 03 '23 13:03 jc-harrison