datacube-core icon indicating copy to clipboard operation
datacube-core copied to clipboard

Postgres Connection Timeout Crash

Open omad opened this issue 2 months ago • 9 comments

I'm hunting a crash for some users running ODC code on the NCI, somewhere in database connection handling. A unique quirk of the NCI environment is that idle TCP Connections are killed after 5 minutes, whereas in most environments it's 2 hours. This is probably relevant, but I'm not sure.

An example stack trace
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 60, in decorated
    raise ex
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 52, in decorated
    result = fn(*args, **kwargs)
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 851, in scene_select
    l1_count, uuids2archive = l1_scenes_to_process(
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 588, in l1_scenes_to_process
    files2process, uuids2archive, duplicates = l1_filter(
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 446, in l1_filter
    for l1_dataset in dc.index.datasets.search(product=l1_product, time=month_as_range(year, month)):
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 548, in search
    for product, datasets in self._do_search_by_product(query,
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 691, in _do_search_by_product
    with self._db_connection() as connection:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_transaction.py", line 60, in _db_connection
    with self._index._active_connection(transaction=transaction) as conn:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/index.py", line 157, in _active_connection
    with self._db._connect() as conn:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/drivers/postgres/_connections.py", line 231, in _connect
    connection.close()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1251, in close
    conn.close()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1166, in close
    self._checkin()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1008, in _checkin
    _finalize_fairy(
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
    connection_record.checkin()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 542, in checkin
    finalizer(connection)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 677, in _reset_characteristics
    characteristic.reset_characteristic(self, dbapi_connection)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/characteristics.py", line 50, in reset_characteristic
    dialect.reset_isolation_level(dbapi_conn)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in reset_isolation_level
    self.set_isolation_level(dbapi_conn, self.default_isolation_level)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 851, in set_isolation_level
    connection.set_isolation_level(level)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

I can reliably cause exceptions using plain psycopg2, by attempting to use a connection that's been idle for more than 5 minutes.

I haven't been able to reproduce the exception using the ODC Index API. I've been inserting time.sleep(330) between different calls, which is what I could do with plain psycopg2.

I suspect that ODC is more resilient due to setting pool_recycle=60 on the SQLAlchemy connection pool. https://github.com/opendatacube/datacube-core/blob/4f1e636f8969e553d12835c73a6fedb28844f065/datacube/drivers/postgres/_connections.py#L131-L134

One line in the traceback that I'm suspicious of is this line 231, since I think exiting the context would close the connection. Manually calling .close() seems unnecessary. https://github.com/opendatacube/datacube-core/blob/4f1e636f8969e553d12835c73a6fedb28844f065/datacube/drivers/postgres/_connections.py#L226-L231

Summary

  • This script doesn't always crash.
  • This isn't the only time we've seen crashes that look like this on the NCI.
  • I've run out of steam investigating this, since I can't get a small reproducible example. For now at least.
  • I really want something reproducible before attempting a fix.
  • The connection handling code in ODC is far more complicated than I would have expected, and I'm not convinced it needs to be that way. However, maybe it does, I'm just not sure. :)

omad avatar Jun 27 '24 03:06 omad