toil
toil copied to clipboard
The scheduler and file system can race
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1054
The problem here can manifest as https://github.com/DataBiosphere/toil/issues/3758#issuecomment-906310128 which @Guigzai experienced. If we hear back from the scheduler/batch system that a job is done but not all its writes are visible on disk, we can get into trouble.
Dealing with missing files might just involve waiting for them to appear, but dealing with files that were supposed to have been replaced/modified but weren't is going to be harder.
@adamnovak is this being considered in the future development plan. Kindly help as this is quite recurring and impacts many pipelines that we work on.
Hello,
We have updated toil 5.7.1 -> 5.11.0 We made tests.
Update FIX a recurrent bug explained in #3758 or #4092 .
Thanks you for your job.
➤ Adam Novak commented:
We should add a clock to the job description in the file job store, so we can look at it and know whether or not the writes (or job description deletion) from any particular invocation of the job are visible. Then the leader can poll until the writes are visible, and then proceed with re-scheduling the job or scheduling the successors.
➤ Adam Novak commented:
We should maybe just steal Snakemake’s idea of having a configurable wait time for the filesystem to settle.