toil icon indicating copy to clipboard operation
toil copied to clipboard

The scheduler and file system can race

Open unito-bot opened this issue 3 years ago • 5 comments

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1054

unito-bot avatar Oct 12 '21 16:10 unito-bot

The problem here can manifest as https://github.com/DataBiosphere/toil/issues/3758#issuecomment-906310128 which @Guigzai experienced. If we hear back from the scheduler/batch system that a job is done but not all its writes are visible on disk, we can get into trouble.

Dealing with missing files might just involve waiting for them to appear, but dealing with files that were supposed to have been replaced/modified but weren't is going to be harder.

adamnovak avatar Oct 13 '21 14:10 adamnovak

@adamnovak is this being considered in the future development plan. Kindly help as this is quite recurring and impacts many pipelines that we work on.

rohith-bs avatar May 24 '23 11:05 rohith-bs

Hello,

We have updated toil 5.7.1 -> 5.11.0 We made tests.

Update FIX a recurrent bug explained in #3758 or #4092 .

Thanks you for your job.

Guigzai avatar Jun 16 '23 15:06 Guigzai

➤ Adam Novak commented:

We should add a clock to the job description in the file job store, so we can look at it and know whether or not the writes (or job description deletion) from any particular invocation of the job are visible. Then the leader can poll until the writes are visible, and then proceed with re-scheduling the job or scheduling the successors.

unito-bot avatar Sep 19 '23 17:09 unito-bot

➤ Adam Novak commented:

We should maybe just steal Snakemake’s idea of having a configurable wait time for the filesystem to settle.

unito-bot avatar Feb 06 '24 18:02 unito-bot