SlurmError: sbatch slurm.job
Hi :wave:
I'm currently trying to run HPC-rocket to submit a job from my local machine (before integrating it in a GitLab CI/CD pipeline).
I created a simple slurm.job that only prints the hostname to check if the job runs properly.
Here is my configuration:
host: ...
user: ...
private_keyfile: ...
copy:
- from: slurm.job
to: slurm.job
overwrite: true
clean:
- slurm.job
- slurm-hpc-rocket.log
sbatch: slurm.job
When I run the following command:
hpc-rocket launch --watch config.yml
I get the following output:
ℹ Copying files...
✔ Done
❌ SlurmError: sbatch slurm.job
[== ]
Since there is no additional logs, and the job runs when I submit it manually on the cluster, do you have any idea of what could be the problem here?
Thanks!
Hi, sorry for the late reply, I just got back from a vacation. This error usually happens when hpc-rocket fails to launch the slurm job entirely. Can you show me the content of the log file of the slurm job?
Sorry for the delay, I also was away for the last 2 weeks.
I don't have a log from the job.
Using sacct shows that it is not even submitted.
Can it be due to the fact that slurm is actually a module on our cluster, meaning it may not be loaded at the start of the session depending on the type of session HPCrocket uses?
EDIT:
I just tried to change the command for something else, and it looks like none of the commands I tried proceed without raising an error. Hence, it means that, for some reason, the call to cmd.wait_until_exit() always returns a non-zero exit code.
Since HPC rocket manages to copy the files to the remote, it looks like it does not come from a connection issue...