Solve issue with losing track of SLURM jobs using cephfs
The changes solve an issue with losing track of SLURM jobs on systems using cephfs when the number of jobs is high.
Test case: running a pipeline that creates thousands of jobs on a SLURM cluster with a queuesize of 600 and cephfs (reef 18.2.2) as shared filesystem for the nextflow workDirs.
There are existing open issues and mentions of this problem on the git issue tracker as well i.e. #2695, #5630, #5650
As @burcarjo already tried in #5650, adding ceph as shared filesystem
is not solving the issue, as, at least in our tests, it never reaches
the relevant code which should trigger a metadata refresh.
The proposed fix does a metadata refresh in multiple locations of the code and in our tests we were able to run a pipeline with > 21k jobs without any hang. Without the proposed changes there was no way to get this pipeline to completion in a single run.
As the proposed changes refresh the metadata nearly once per job in
GridTaskHandler.groovy we also tried without refresh in this part
however it resulted in stalling again.
Deploy Preview for nextflow-docs-staging canceled.
| Name | Link |
|---|---|
| Latest commit | 8a8a3d32fa88c959b64a14a5cfa49d392acae565 |
| Latest deploy log | https://app.netlify.com/sites/nextflow-docs-staging/deploys/67ffb0ad8590030008ddacb4 |
So you did an ablation analysis and found that all of these refreshes were required to prevent the jobs from hanging?
Yeah basically yes, I did run an ablation analysis and looked at the -trace output to identify possible places where we can try to refresh the metadata.
I'm vey hesitant regarding this PR, having a large number of jobs, syncing continuously the file system can lead to a network congestion.
I understand, in my tests the number of syncs/directory listings was roughly corresponding the number of submitted jobs (21k jobs ~ 20k refresh), we did not see any network issues.
I can try to omit the sync call though and only do directory listing.