cylc-flow public database not closed correctly - consider persisting the connection

Progress:

[x] Public DB not closed correctly (#4222)
[ ] Persist connections to the private and possibly public databases during lifetime of scheduler?

The Issue:

Cylc is not closing the public database file (despite logic appearing to close it correctly). We also seem to be hanging onto the canonical log dir.

Once Python exits the file descriptor terminates so this hasn't proven symptomatic before, however, with #3654 we can run multiple schedulers in the same event loop, this bug keeps the public database open until Python exits which prevents reliably housekeeping.

Connect - Close:

At the moment we connect to and close the database very often, too often. According to the code comments this was an early form of filesystem integrity checking which caused Cylc to exit if the run directory had been emptied.

We now have a separate check for this (which can be improved if necessary) so should be able to dispense with the (highly inefficient) connection/closing of the DB.

Apply this diff then look at the logs. Warning it might scare you, sometimes ignorance is bliss:

diff --git a/cylc/flow/rundb.py b/cylc/flow/rundb.py
index 37f3789ad..19dda6ed1 100644
--- a/cylc/flow/rundb.py
+++ b/cylc/flow/rundb.py
@@ -367,7 +367,9 @@ class CylcSuiteDAO(object):
 
     def close(self):
         """Explicitly close the connection."""
+        LOG.warning('# close')
         if self.conn is not None:
+            LOG.warning('$ close')
             try:
                 self.conn.close()
             except sqlite3.Error:
@@ -376,7 +378,9 @@ class CylcSuiteDAO(object):
 
     def connect(self):
         """Connect to the database."""
+        LOG.warning('# connect')
         if self.conn is None:
+            LOG.warning('$ re-connect')
             self.conn = sqlite3.connect(self.db_file_name, self.CONN_TIMEOUT)
         return self.conn

The Evidence:

This diff reveals that the public database (which should be closed during schd.shutdown() appears to remain open until Python exits.

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 5eb285f8d..44c0e8369 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -121,6 +121,12 @@ class SchedulerUUID(object):
         return self.value
 
 
+def log_open_files():
+    import psutil
+    proc = psutil.Process()
+    LOG.warning(f'# {proc.open_files()}')
+
+
 class Scheduler(object):
     """Cylc scheduler server."""
 
@@ -355,6 +361,7 @@ class Scheduler(object):
 
     def close_logs(self):
         """Close the Cylc logger."""
+        log_open_files()
         LOG.info("DONE")  # main thread exit
         self.profiler.stop()
         for handler in LOG.handlers:

2020-06-30T09:07:18+01:00 WARNING - # [popenfile(path=~/cylc-run/one/log/suite/log.20200630T090714+01', fd=6),
	popenfile(path='~/cylc-run/one/log/db', fd=35)]

Alternative approach, inserting a sleep then using lsof | grep <pid> yields the same result:

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 5eb285f8d..e6f07f27b 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -355,6 +363,7 @@ class Scheduler(object):
 
     def close_logs(self):
         """Close the Cylc logger."""
+        log_open_files()
         LOG.info("DONE")  # main thread exit
         self.profiler.stop()
         for handler in LOG.handlers:

Jun 30 '20 08:06 oliver-sanders

This is causing the integration tests' log directories to stick around during cleanup. Quite strange, I've done some debugging and I can't figure out what part of the code is still maintaining a connection to the public db after the Scheduler.shutdown() method

May 19 '21 18:05 MetRonnie

I think I've found the source of the issue. It doesn't seem to be a maintained sqlite3 connection, in fact. I think it's because the WorkflowDatabaseManager.copy_pri_to_pub() method isn't letting go of the pub_dao file for some reason

May 20 '21 10:05 MetRonnie

Hmmm. It's just setting self.pub_dao.conn = None, it isn't calling conn.close(), might be related.

May 20 '21 10:05 oliver-sanders

I think I've solved it, will put up a PR after lunch

May 20 '21 11:05 MetRonnie

Just to note: there's some suggestions in https://github.com/cylc/cylc-flow/pull/4222#discussion_r636771404 which I started working on a while ago and will revisit at some point

Aug 24 '21 16:08 MetRonnie

(removed the bug label since the bug portion of this issue has been resolved)

Aug 02 '22 15:08 oliver-sanders