dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

mem error

Open MaAl13 opened this issue 1 year ago • 1 comments

Hello everyone,

I am trying to run Dask on our cluster, however the following code produces an error i don't really know how to solve

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
            cores=24,
            job_extra_directives=['--mem-per-cpu=500MB'],  # Request 500MB per CPU
            queue="regular",
            account="user",
            memory = "12 GB")

cluster.scale(10)
  • tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x2b83c3fcf7a0>>, <Task finished name='Task-1267' coro=<SpecCluster._correct_state_internal() done, defined at /cluster/home/user/miniconda/lib/python3.12/site-packages/distributed/deploy/spec.py:346> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nsbatch /tmp/tmp28aolw5t.sh\nstdout:\n\nstderr:\nsbatch: error: lua: Requesting memory by node is not supported. Use --mem-per-cpu.\nsbatch: error: cli_filter plugin terminated with error\n\n')>) Traceback (most recent call last): File "/cluster/home/user/miniconda/lib/python3.12/site-packages/tornado/ioloop.py", line 738, in _run_callback ret = callback() ^^^^^^^^^^ File "/cluster/home/user/miniconda/lib/python3.12/site-packages/tornado/ioloop.py", line 762, in _discard_future_result future.result() File "/cluster/home/user/miniconda/lib/python3.12/site-packages/distributed/deploy/spec.py", line 390, in _correct_state_internal await asyncio.gather(*worker_futs) File "/cluster/home/user/miniconda/lib/python3.12/asyncio/tasks.py", line 684, in _wrap_awaitable return await awaitable ^^^^^^^^^^^^^^^ File "/cluster/home/user/miniconda/lib/python3.12/site-packages/distributed/deploy/spec.py", line 74, in _ await self.start() File "/cluster/home/user/miniconda/lib/python3.12/site-packages/dask_jobqueue/core.py", line 426, in start out = await self._submit_job(fn) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/cluster/home/user/miniconda/lib/python3.12/site-packages/dask_jobqueue/core.py", line 409, in _submit_job return await self._call(shlex.split(self.submit_command) + [script_filename]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/cluster/home/user/miniconda/lib/python3.12/site-packages/dask_jobqueue/core.py", line 505, in _call raise RuntimeError( RuntimeError: Command exited with non-zero exit code. Exit code: 1 Command: sbatch /tmp/tmp28aolw5t.sh stdout:

stderr: sbatch: error: lua: Requesting memory by node is not supported. Use --mem-per-cpu. sbatch: error: cli_filter plugin terminated with error

MaAl13 avatar Feb 29 '24 14:02 MaAl13

Hi @MaAl13,

Your cluster is configured in a way to refuse submission scripts using --mem= sbatch option. You already added the --mem-per-cpu, you should also skip the other directive, see job_directives_skip kwarg.

guillaumeeb avatar Mar 06 '24 15:03 guillaumeeb