toil Toil tries to use caching by default even when it's not appropriate

Toil version 5.6.0, running cactus on SLURM fails onb gets the error: sqlite3.OperationalError: database is locked

[2022-09-01T14:24:28-0500] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/Z/instance-tk5kjaea v2
[2022-09-01T14:24:28-0500] [MainThread] [W] [toil.leader] Log from job "kind-LastzRepeatMaskJob/Z/instance-tk5kjaea" follows:
=========>
	[2022-09-01T14:22:32-0500] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
	[2022-09-01T14:22:32-0500] [MainThread] [I] [toil] Running Toil version 5.6.0-c34146a6437e4407a61e946e968bcce67a0ebbca on host cpu-19-16.
	[2022-09-01T14:22:32-0500] [MainThread] [I] [toil.worker] Working on job 'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/Z/instance-tk5kjaea v1
	[2022-09-01T14:22:33-0500] [MainThread] [I] [toil.worker] Loaded body Job('LastzRepeatMaskJob' kind-LastzRepeatMaskJob/Z/instance-tk5kjaea v1) from description 'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/Z/instance-tk5kjaea v1
	Traceback (most recent call last):
	  File "/lustre/work/mhoyosro/software/cactus3/cactus-bin-v2.1.1/cactus_env/lib/python3.8/site-packages/toil/worker.py", line 392, in workerScript
	    with fileStore.open(job):
	  File "/home/mhoyosro/conda/envs/cactus/lib/python3.8/contextlib.py", line 113, in __enter__
	    return next(self.gen)
	  File "/lustre/work/mhoyosro/software/cactus3/cactus-bin-v2.1.1/cactus_env/lib/python3.8/site-packages/toil/fileStores/cachingFileStore.py", line 992, in open
	    self._allocateSpaceForJob(self.jobDiskBytes)
	  File "/lustre/work/mhoyosro/software/cactus3/cactus-bin-v2.1.1/cactus_env/lib/python3.8/site-packages/toil/fileStores/cachingFileStore.py", line 803, in _allocateSpaceForJob
	    available = self.getCacheAvailable()
	  File "/lustre/work/mhoyosro/software/cactus3/cactus-bin-v2.1.1/cactus_env/lib/python3.8/site-packages/toil/fileStores/cachingFileStore.py", line 454, in getCacheAvailable
	    if self.cachingIsFree():
	  File "/lustre/work/mhoyosro/software/cactus3/cactus-bin-v2.1.1/cactus_env/lib/python3.8/site-packages/toil/fileStores/cachingFileStore.py", line 560, in cachingIsFree
	    for row in self.cur.execute('SELECT value FROM properties WHERE name = ?', ('freeCaching',)):
	sqlite3.OperationalError: database is locked
	[2022-09-01T14:24:27-0500] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host cpu-19-16
<=========

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1220

Sep 17 '22 16:09 diekhans

The workaround here would be to turn off Toil's caching system: --disableCaching=true.

I think this problem will happen when the caching database (and also the Toil --workDir work directory that derfaults to being in $TMPDIR) isn't really on storage local to each machine, but is actually on NFS or other shared storage. All the jobs in the workflow try to use the same caching database, instead of the much smaller number of jobs on a single node that are supposed to be doing that, and that combined with the slowness of file locks on distributed filesystems means that some jobs can't get a lock on the caching database when they want one, and they fail.

If you do want caching to be on, you want --workDir to definitely be local, non-shared storage. Otherwise, the caching is kind of useless, unless your job store is even more remote than your shared filesystem.

Sep 19 '22 14:09 adamnovak

To avoid this problem in the future, we could make the default for the caching feature more complex: We could turn it off by default when using the file job store, or we could turn it off by default (and maybe warn?) if the caching database would be on something that looks like a shared filesystem.

This is related to #3769, and maybe some of what I was talking about in https://github.com/DataBiosphere/toil/issues/4122#issuecomment-1146047488

Sep 19 '22 14:09 adamnovak

If the cache is host-local, then always putting it in a sub-directory that includes the host name as part of the path will prevent collisions.

A warning will most likely not be noticed.

I am not sure how one detects a local vs a network file system. mount -l is not portable.

Sep 19 '22 15:09 diekhans

If we make each node use its own node-specific subdirectory in the cache directory, we could run into problems with space accounting. They all see there is 10 GB of free cache space, they all try to download and keep around 10 GB of cached files, and some of them run out of disk space because other workers are writing to the same mount, and none of them know when to evict from the cache to free space up.

Though in situations where the cache filesystem is shared, it is probably shared with other users, and there's no reason to expect that all the space you see free at the start of the job or run will remain actually available to use during the run, anyway...

Nov 15 '22 15:11 adamnovak