webknossos-libs
webknossos-libs copied to clipboard
Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable
Hi,
we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API. In particular we get the error message:
error: slurm_receive_msgs: [[$hostname]:$port] failed: Socket timed out on send/recv operation
our cluster team traced the error down to the SLURM controller being overwhelmed by the number squeue requests. Technically we could downscale the number of concurrent downsampling jobs, but that would negatively impact the overall cluster utilization as well as throughput.
Alternatively we searched for squeue commands in the cluster-tools API. We noticed the line self.executor.get_pending_tasks() in file_wait_thread.py. It seems like you already implemented a polling throttle there via the interval parameter but never expose the parameter to ClusterExecutor or SlurmExecutor to reduce the number of squeue calls.
Therefore I would like to propose a change where SlurmExecutor users can set the polling interval (in seconds) programmatically in their python program or alternatively via environment variable.
I am happy to make any additional changes to this pull request and add documentation if necessary.
Best wishes, Eric
Issues:
- Expose the
FileWaitThread'sintervalparameter toSlurmExecutor - Set the global variable
SLURM_QUEUE_CHECK_INTERVALvia environment variable to provide the same functionality in a CLI-only setting.
Todos:
Make sure to delete unnecessary points or to check all before merging:
- [ ] Updated Changelog
- [ ] Updated Documentation
- [ ] Added / Updated Tests
Hi @erjel,
thank you for your contribution! Before talking about your proposed solution, I would like to understand the problem a bit better.
we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API.
How many datasets do you downsample in parallel? There should only be one SlurmExecutor instance per dataset downsampling and therefore, only one polling party per dataset.
Technically we could downscale the number of concurrent downsampling jobs, but [...]
By "number of concurrent downsampling jobs" you mean number of datasets being conurrently downsampled, right?
our cluster team traced the error down to the SLURM controller being overwhelmed by the number squeue requests.
How many squeue requests are we talking about and what interval do you want to configure to mitigate the issue?
Thank you!