dask-jobqueue
dask-jobqueue copied to clipboard
mem error
Hello everyone,
I am trying to run Dask on our cluster, however the following code produces an error i don't really know how to solve
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
cores=24,
job_extra_directives=['--mem-per-cpu=500MB'], # Request 500MB per CPU
queue="regular",
account="user",
memory = "12 GB")
cluster.scale(10)
- tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x2b83c3fcf7a0>>, <Task finished name='Task-1267' coro=<SpecCluster._correct_state_internal() done, defined at /cluster/home/user/miniconda/lib/python3.12/site-packages/distributed/deploy/spec.py:346> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nsbatch /tmp/tmp28aolw5t.sh\nstdout:\n\nstderr:\nsbatch: error: lua: Requesting memory by node is not supported. Use --mem-per-cpu.\nsbatch: error: cli_filter plugin terminated with error\n\n')>) Traceback (most recent call last): File "/cluster/home/user/miniconda/lib/python3.12/site-packages/tornado/ioloop.py", line 738, in _run_callback ret = callback() ^^^^^^^^^^ File "/cluster/home/user/miniconda/lib/python3.12/site-packages/tornado/ioloop.py", line 762, in _discard_future_result future.result() File "/cluster/home/user/miniconda/lib/python3.12/site-packages/distributed/deploy/spec.py", line 390, in _correct_state_internal await asyncio.gather(*worker_futs) File "/cluster/home/user/miniconda/lib/python3.12/asyncio/tasks.py", line 684, in _wrap_awaitable return await awaitable ^^^^^^^^^^^^^^^ File "/cluster/home/user/miniconda/lib/python3.12/site-packages/distributed/deploy/spec.py", line 74, in _ await self.start() File "/cluster/home/user/miniconda/lib/python3.12/site-packages/dask_jobqueue/core.py", line 426, in start out = await self._submit_job(fn) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/cluster/home/user/miniconda/lib/python3.12/site-packages/dask_jobqueue/core.py", line 409, in _submit_job return await self._call(shlex.split(self.submit_command) + [script_filename]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/cluster/home/user/miniconda/lib/python3.12/site-packages/dask_jobqueue/core.py", line 505, in _call raise RuntimeError( RuntimeError: Command exited with non-zero exit code. Exit code: 1 Command: sbatch /tmp/tmp28aolw5t.sh stdout:
stderr: sbatch: error: lua: Requesting memory by node is not supported. Use --mem-per-cpu. sbatch: error: cli_filter plugin terminated with error
Hi @MaAl13,
Your cluster is configured in a way to refuse submission scripts using --mem=
sbatch option. You already added the --mem-per-cpu
, you should also skip the other directive, see job_directives_skip
kwarg.