toil If the disk where a FileJobStore is stored fills up, Toil can get into a state where the root job does not exist, but neither does the ultimate source for its promised return value

inconsistent state after running out of disk space might be the cause of this:

https://github.com/ComparativeGenomicsToolkit/cactus/issues/909

wastes a month of compute

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1270

Jan 25 '23 03:01 diekhans

It looks like the log when trying to recover from the offending job store is:

root@2db0c9c3c2ce:/data# cactus --restart jobStore/ cactus_20DEC2022.config marsupials_20DEC2022.hal
[2023-01-23T21:28:52+0000] [MainThread] [I] [toil.statsAndLogging] Enabling realtime logging in Toil
[2023-01-23T21:28:52+0000] [MainThread] [I] [toil.statsAndLogging] Cactus Command: /home/cactus/cactus_env/bin/cactus --restart jobStore/ cactus_20DEC2022.config marsupials_20DEC2022.hal
[2023-01-23T21:28:52+0000] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: b699f6bfb4fe88b7876e533dede24e48bbfeb700
[2023-01-23T21:28:52+0000] [MainThread] [W] [toil.common] Requested restart but the workflow has already been completed; allowing exports to rerun.
Traceback (most recent call last):
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 989, in restart
    self._jobStore.load_root_job()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/jobStores/abstractJobStore.py", line 266, in load_root_job
    raise JobException("The root job '%s' doesn't exist. Either the Toil workflow "
toil.job.JobException: The root job 'kind-progressive_workflow/instance-rz5ckgcx' doesn't exist. Either the Toil workflow is finished or has never been started

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cactus/cactus_env/bin/cactus", line 8, in <module>
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/progressive/cactus_progressive.py", line 365, in main
    hal_id = toil.restart()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 993, in restart
    return self._jobStore.get_root_job_return_value()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/jobStores/abstractJobStore.py", line 300, in get_root_job_return_value
    return safeUnpickleFromStream(fH)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1687, in safeUnpickleFromStream
    return pickle.loads(string)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2911, in __new__
    return cls._resolve(*args)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2923, in _resolve
    value = safeUnpickleFromStream(fileHandle)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1687, in safeUnpickleFromStream
    return pickle.loads(string)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2911, in __new__
    return cls._resolve(*args)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2923, in _resolve
    value = safeUnpickleFromStream(fileHandle)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1687, in safeUnpickleFromStream
    return pickle.loads(string)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3008, in __setstate__
    raise RuntimeError(
RuntimeError: This job was passed promise files/no-job/file-a3084765d34e4412820383ce74e9ef53/stream that wasn't yet resolved when it ran. The job 'export_hal' kind-export_hal/instance-2j6g929e v0 that fulfills this promise hasn't yet finished. This means that there aren't enough constraints to ensure the current job always runs after 'export_hal' kind-export_hal/instance-2j6g929e v0. Consider adding a follow-on indirection between this job and its parent, or adding this job as a child/follow-on of 'export_hal' kind-export_hal/instance-2j6g929e v0.

It looks like the root job is missing, either because it finished and was deleted, or because we attempted to replace it and lost it. When we go to get the root job's return value, we follow a chain of promises for return values until we get to one that is missing.

So either the root job got deleted before the workflow finished, or a promised return value file went missing, maybe after the workflow finished.

We do indeed make sure to replace jobs only with renames, which we depend on being atomic, so I'm not sure how the root job could have gone unexpectedly missing unless the filesystem doesn't work the way we depend on:

https://github.com/DataBiosphere/toil/blob/82c43671ec8cd08df2810f8852887ee1b09d12dd/src/toil/jobStores/fileJobStore.py#L255-L265

I'm more suspicious of the promise logic: we seem to add promises' value files to a Promise-class-scoped list of files to delete, when we load them:

https://github.com/DataBiosphere/toil/blob/82c43671ec8cd08df2810f8852887ee1b09d12dd/src/toil/job.py#L3245

These get attached to whatever job next has an _executor context manager entered:

https://github.com/DataBiosphere/toil/blob/1e94d76640be0ee58c5c746eb03fe98863fdc869/src/toil/job.py#L2661-L2672

I'm not convinced of the correctness of this logic just by reading the code; are we sure that promise value loading always and only happens at the right time in the right processes for this to make sense? Are we sure nothing relied on here changed with e.g. #4259 when we moved some cleanup into the leader?

Jan 26 '23 16:01 adamnovak

We do indeed make sure to replace jobs only with renames, which we depend on being atomic, so I'm not sure hoe the root job could have gone unexpectedly missing unless the filesystem doesn't work the way we depend on:

Maybe they are using NFS :-( :-(

Jan 26 '23 17:01 diekhans

Apparently there are a couple ways you can get a rename-over to make it look like the file doesn't exist in NFS:

You look up the file by the name, but it gets clobbered before you finish opening it, so you get an ENOENT.
The rename in NFS is really a delete-and-recreate dance and you look between the delete and the recreate.

But it's not clear whether NFS can fail in a way where you get stuck in one of these states, or whether e.g. the delete will roll back if somehow you don't have disk for the recreate. NFS isn't the real backing filesystem on disk, and being able to do a rename and have it fail in such a way that the file is just gone now violates "The RENAME operation must be atomic to the client.", since the client issuing the RENAME would need to eventually see that it had just vanished.

Running out of disk in the middle of a delete-and-recreate dance is a candidate for the cause for this issue, since we think it is disk space quota related. Maybe adding the quota on (this?) NFS can break rename atomicity for the issuing client?

Dec 11 '24 18:12 adamnovak