nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Solve issue with losing track of SLURM jobs using cephfs

Open riederd opened this issue 8 months ago • 5 comments

The changes solve an issue with losing track of SLURM jobs on systems using cephfs when the number of jobs is high.

Test case: running a pipeline that creates thousands of jobs on a SLURM cluster with a queuesize of 600 and cephfs (reef 18.2.2) as shared filesystem for the nextflow workDirs.

There are existing open issues and mentions of this problem on the git issue tracker as well i.e. #2695, #5630, #5650

As @burcarjo already tried in #5650, adding ceph as shared filesystem is not solving the issue, as, at least in our tests, it never reaches the relevant code which should trigger a metadata refresh.

The proposed fix does a metadata refresh in multiple locations of the code and in our tests we were able to run a pipeline with > 21k jobs without any hang. Without the proposed changes there was no way to get this pipeline to completion in a single run.

As the proposed changes refresh the metadata nearly once per job in GridTaskHandler.groovy we also tried without refresh in this part however it resulted in stalling again.

riederd avatar Apr 15 '25 06:04 riederd

Deploy Preview for nextflow-docs-staging canceled.

Name Link
Latest commit 8a8a3d32fa88c959b64a14a5cfa49d392acae565
Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67ffb0ad8590030008ddacb4

netlify[bot] avatar Apr 15 '25 06:04 netlify[bot]

So you did an ablation analysis and found that all of these refreshes were required to prevent the jobs from hanging?

bentsherman avatar Apr 16 '25 13:04 bentsherman

Yeah basically yes, I did run an ablation analysis and looked at the -trace output to identify possible places where we can try to refresh the metadata.

riederd avatar Apr 16 '25 14:04 riederd

I'm vey hesitant regarding this PR, having a large number of jobs, syncing continuously the file system can lead to a network congestion.

pditommaso avatar Apr 17 '25 08:04 pditommaso

I understand, in my tests the number of syncs/directory listings was roughly corresponding the number of submitted jobs (21k jobs ~ 20k refresh), we did not see any network issues.

I can try to omit the sync call though and only do directory listing.

riederd avatar Apr 17 '25 09:04 riederd