dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

Unable to load scheduler dashboard in SLURMRunner, but can in cluster

Open gilmorethomas opened this issue 8 months ago • 1 comments

Describe the issue: Thanks for your time in advance. I have created a simple "hello world" example of a SLURMRunner and SLURMCluster in my environment. I like the interface for the SLURMRunner instead of effectively needing to create wrappers around jobs in the SLURMCluster construct.

I dispatch my SLURMCluster job via sbatch (since my login node cannot run my scheduler) to a worker node (node-01), and then this dispatches additional jobs on my worker nodes (node[01-06]). When I do this, I am able to visit the scheduler dashboard, although I am seeing slightly weird behavior in job allocation (not the point of this post, I need to look into this more).

When I create my SLURMRunner (same as this example https://jobqueue.dask.org/en/stable/runners-overview.html), my jobs are getting allocated and run, but I am unable to load the scheduler dashboard. I get a 404 Page Not Found when I visit the scheduler link output by the client.dashboard_link and also in the scheduler.json file. This is not the same as when the runner spins down, as in this case I get the Connection Refused. Is this expected?

Minimal Complete Verifiable Example: Using the SLURMRunner in my multi-node environment

# Put your MCVE code here

Anything else we need to know?:

Environment:

  • Dask version: 2023.6.0
  • Python version: 3.11.4
  • Operating System: CENTOS-7
  • Install method: pip

gilmorethomas avatar Feb 14 '25 22:02 gilmorethomas