webknossos-libs Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable

Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable

Open erjel opened this issue 1 year ago • 1 comments

Hi,

we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API. In particular we get the error message:

 error: slurm_receive_msgs: [[$hostname]:$port] failed: Socket timed out on send/recv operation

our cluster team traced the error down to the SLURM controller being overwhelmed by the number squeue requests. Technically we could downscale the number of concurrent downsampling jobs, but that would negatively impact the overall cluster utilization as well as throughput.

Alternatively we searched for squeue commands in the cluster-tools API. We noticed the line self.executor.get_pending_tasks() in file_wait_thread.py. It seems like you already implemented a polling throttle there via the interval parameter but never expose the parameter to ClusterExecutor or SlurmExecutor to reduce the number of squeue calls.

Therefore I would like to propose a change where SlurmExecutor users can set the polling interval (in seconds) programmatically in their python program or alternatively via environment variable.

I am happy to make any additional changes to this pull request and add documentation if necessary.

Best wishes, Eric

Issues:

Expose the FileWaitThread's interval parameter to SlurmExecutor
Set the global variable SLURM_QUEUE_CHECK_INTERVAL via environment variable to provide the same functionality in a CLI-only setting.

Todos:

Make sure to delete unnecessary points or to check all before merging:

[ ] Updated Changelog
[ ] Updated Documentation
[ ] Added / Updated Tests

Jul 24 '24 13:07 erjel

Hi @erjel,

thank you for your contribution! Before talking about your proposed solution, I would like to understand the problem a bit better.

we regularly run into problems with our SLURM cluster while running (very) large downsampling jobs on webknossos datasets either via CLI or chunk-wise processing of large datasets via the cluster-tools python API.

How many datasets do you downsample in parallel? There should only be one SlurmExecutor instance per dataset downsampling and therefore, only one polling party per dataset.

Technically we could downscale the number of concurrent downsampling jobs, but [...]

By "number of concurrent downsampling jobs" you mean number of datasets being conurrently downsampled, right?

our cluster team traced the error down to the SLURM controller being overwhelmed by the number squeue requests.

How many squeue requests are we talking about and what interval do you want to configure to mitigate the issue?

Thank you!

Jul 24 '24 14:07 philippotto

webknossos-libs webknossos-libs copied to clipboard

Expose squeue polling interval to SlurmExecutor parameter and allow setting via env variable

Issues:

Todos:

webknossos-libs
webknossos-libs copied to clipboard