FlowKit icon indicating copy to clipboard operation
FlowKit copied to clipboard

Cache shrink fails

Open jc-harrison opened this issue 2 years ago • 4 comments

For a while I've been occasionally seeing that the flowmachine cache-shrink task fails for no apparent reason. I've now found an exception corresponding to the failure:

Task exception was never retrieved
future: <Task finished name='Task-2' coro=<watch_and_shrink_cache() done, defined at /root/.local/share/virtualenvs/flowmachine-PkgRmmIU/lib/python3.8/site-packages/flowmachine-1.18.
1-py3.8.egg/flowmachine/core/cache.py:872> exception=TypeError("unsupported operand type(s) for -=: 'float' and 'NoneType'")>
Traceback (most recent call last):
  File "/root/.local/share/virtualenvs/flowmachine-PkgRmmIU/lib/python3.8/site-packages/flowmachine-1.18.1-py3.8.egg/flowmachine/core/cache.py", line 923, in watch_and_shrink_cache
    await asyncio.wait_for(
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/.local/share/virtualenvs/flowmachine-PkgRmmIU/lib/python3.8/site-packages/flowmachine-1.18.1-py3.8.egg/flowmachine/core/cache.py", line 618, in shrink_below_size
    current_cache_size -= cache_reduction
TypeError: unsupported operand type(s) for -=: 'float' and 'NoneType'

jc-harrison avatar May 03 '23 17:05 jc-harrison

Logs immediately preceding the exception:

{"submodule":"flowmachine.core.cache","dry_run":false,"query_id":"07efe1909cb534288193cc5b89cae74b","table":"cache.x07efe1909cb534288193cc5b89cae74b","table_size":null,"event":"Remove cache record.","logger":"flowmachine.debug","level":"info","timestamp":"2023-04-27T00:56:24.531560Z"}
{"submodule":"flowmachine.core.query","query_id":"07efe1909cb534288193cc5b89cae74b","action":"invalidate_db_cache","event":"Resetting state machine.","logger":"flowmachine.debug","level":"debug","timestamp":"2023-04-27T00:56:24.532496Z"}
{"submodule":"flowmachine.core.query","query_id":"07efe1909cb534288193cc5b89cae74b","action":"invalidate_db_cache","event":"Reset state machine.","logger":"flowmachine.debug","level":"debug","timestamp":"2023-04-27T00:56:24.533059Z"}
{"submodule":"flowmachine.core.query","query_id":"07efe1909cb534288193cc5b89cae74b","action":"invalidate_db_cache","event":"Getting table reference.","logger":"flowmachine.debug","level":"debug","timestamp":"2023-04-27T00:56:24.533185Z"}
{"submodule":"flowmachine.core.query","query_id":"07efe1909cb534288193cc5b89cae74b","action":"invalidate_db_cache","event":"Query not stored - no table..","logger":"flowmachine.debug","level":"debug","timestamp":"2023-04-27T00:56:24.535776Z"}
{"submodule":"flowmachine.core.query","query_id":"07efe1909cb534288193cc5b89cae74b","action":"invalidate_db_cache","table_name":"cache.x07efe1909cb534288193cc5b89cae74b","event":"Deleted cache record.","logger":"flowmachine.debug","level":"debug","timestamp":"2023-04-27T00:56:24.543376Z"}
{"submodule":"flowmachine.core.query","query_id":"07efe1909cb534288193cc5b89cae74b","action":"invalidate_db_cache","table_name":"cache.x07efe1909cb534288193cc5b89cae74b","event":"Dropped cache table.","logger":"flowmachine.debug","level":"debug","timestamp":"2023-04-27T00:56:24.544117Z"}
{"submodule":"flowmachine.core.query","query_id":"07efe1909cb534288193cc5b89cae74b","action":"invalidate_db_cache","table_name":"cache.x07efe1909cb534288193cc5b89cae74b","event":"Not cascading to dependents.","logger":"flowmachine.debug","level":"debug","timestamp":"2023-04-27T00:56:24.544927Z"}
{"submodule":"flowmachine.core.query","query_id":"07efe1909cb534288193cc5b89cae74b","action":"invalidate_db_cache","table_name":null,"event":"Dropping table outside cache schema.","logger":"flowmachine.debug","level":"debug","timestamp":"2023-04-27T00:56:24.545170Z"}
{"submodule":"flowmachine.core.cache","dry_run":false,"query_id":"07efe1909cb534288193cc5b89cae74b","table":"cache.x07efe1909cb534288193cc5b89cae74b","table_size":null,"event":"Removed cache record.","logger":"flowmachine.debug","level":"info","timestamp":"2023-04-27T00:56:24.547361Z"}

jc-harrison avatar May 03 '23 17:05 jc-harrison

So seems like the issue then is that we're getting a null from pg_total_relation_size, which means that the oid no longer exists, i.e. the actual table no longer exists?

greenape avatar May 31 '23 10:05 greenape

Hmm. I wonder whether it could be related to https://github.com/Flowminder/FlowKit/issues/5926?

jc-harrison avatar May 31 '23 10:05 jc-harrison

A fix would presumably be to return 0 for the size when that happens, but it slightly begs the question of how one arrived at a situation where there was a cache record but no corresponding table, which may as you say be a result of #5926.

I think probably exacerbated by:

CREATE OR REPLACE FUNCTION table_size(IN tablename TEXT, IN table_schema TEXT)
	RETURNS float AS
$$
  DECLARE table_size bigint;
  BEGIN
  SELECT pg_total_relation_size(c.oid) INTO table_size
              FROM pg_class c
              LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
              WHERE relkind = 'r' AND relname=tablename AND nspname=table_schema;
  RETURN table_size;
  END
$$ LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = public, pg_temp;

Because that's actually going to give you a null for a table that doesn't exist instead of giving you an error, as one would get if just doing pg_total_relation_size('<schema>.<table>').

greenape avatar May 31 '23 10:05 greenape