cylc clean's advice on which to find the database file to remove to fix issues could be better
Description
Sometimes we end up with corrupted databases. Running a cylc clean on these gives us the error:
[user@host Suite]$ cylc clean workflow_pp
WARNING - This database is either corrupted or not compatible
with this version of "cylc clean".
Try using the version of Cylc the workflow was last ran with to
remove it.
Otherwise please delete the database file.
CylcError: Clean failed:
Workflow: workflow_pp
Error: Cannot clean workflow_pp - no such table: task_jobs
Deleting $HOME/cylc-run/workflow_pp/runN/log/db doesn't fix the error.
Delving into the code shows that it specifically references $HOME/cylc-run/workflow_pp/runN/.service/db. Deleting this db resolves the issue and the workflow is cleaned up.
It appears there are two (usually identical?) copies of the database. These definitely aren't the same file:
[user@host ~]$ if [ "$(stat -L -c %d:%i $HOME/cylc-run/workflow_pp/runN/log/db)" = "$(stat -L -c %d:%i $HOME/cylc-run/workflow_pp/runN/.service/db)" ]; then
> echo "FILE1 and FILE2 refer to a single file, with one inode, on one device."
> else
> echo "no match"
> fi
no match
Deliberately corrupting $HOME/cylc-run/workflow_pp/runN/.service/db suggests that I can "recover" the database by copying $HOME/cylc-run/workflow_pp/runN/log/db over the top of $HOME/cylc-run/workflow_pp/runN/.service/db.
[user@host ~]$ > cylc-run/workflow_pp/runN/.service/db # truncate to empty file
[user@host ~]$ cylc clean workflow_pp
WARNING - This database is either corrupted or not compatible
with this version of "cylc clean".
Try using the version of Cylc the workflow was last ran with to
remove it.
Otherwise please delete the database file.
CylcError: Clean failed:
Workflow: workflow_pp
Error: Cannot clean workflow_pp - no such table: task_jobs
[user@host ~]$ cp cylc-run/workflow_pp/run1/log/db cylc-run/workflow_pp/run1/.service/
[user@host ~]$ cylc clean workflow_pp
Would clean the following workflows:
workflow_pp/run1
Remove these workflows (y/n): y
INFO - Cleaning workflow_pp/run1 on install target: user:c:kit
INFO - [user:c:kit]
INFO - Removing symlink and its target directory: ...
INFO - Removing symlink and its target directory: ...
INFO - Removing directory: ...
INFO - Removing directory: ...
INFO - Removing directory: ...
INFO - Removing directory: ...
INFO - Removing directory: ...
INFO - Removing directory: ...
INFO - Removing directory: ...
Reproducible Example
- Truncate
.service/db:> $HOME/cylc-run/workflow_pp/runN/.service/db -
cylc clean workflow_pp# Note it fails - Remove "obvious" database:
rm $HOME/cylc-run/workflow_pp/runN/log/db -
cylc clean workflow_pp# Note it continues to fail
Expected Behaviour
Error message would advise that $HOME/cylc-run/workflow_pp/runN/.service/db, specifically, needs to be deleted. For example:
WARNING - This database is either corrupted or not compatible
with this version of "cylc clean".
Try using the version of Cylc the workflow was last ran with to
remove it.
Otherwise please delete the database file: workflow_pp/runN/.service/db
CylcError: Clean failed:
Workflow: workflow_pp
Error: Cannot clean workflow_pp - no such table: task_jobs
Sometimes we end up with corrupted databases
Bigger question, why are you getting corrupted databases! This should not happen and hints at a deeper problem.
Filesystem locks and sqlite implementation should mean that this is not possible. This page outlines the circumstances under which corruption can happen: https://sqlite.org/howtocorrupt.html
It appears there are two (usually identical?) copies of the database
Yes, Cylc maintains two databases, a "private" database in the .service directory for use by the Cylc scheduler and a "public" database in the log directory which may be used by downstream services such as cylc review, rose_prune, fcm_make, etc.
Due to the nature of sqlite, parallel access can potentially result in DB locking issues. The Cylc scheduler will detect a locked public database and recover it from the private database. So the public database serves to isolate the scheduler's database from external interference.
The cylc clean command uses the workflow's database to determine which remote filesystems Cylc has installed the workflow onto so that it can locate and remove these installations. Sadly, if the database has become corrupted, the cylc clean command cannot perform this function. The only thing you can do is get Cylc to remove the local files cylc clean --local and remove the remote files manually. Removing this database has a similar effect (make sure you clean those remote files!).
Closed by #6234?
@jarich, we've changed the error message slightly as requested.
However, this isn't something that should be possible, corrupted databases are a cause for concern. I'll close this issue now, but feel free to follow up on the database issue.