submitit icon indicating copy to clipboard operation
submitit copied to clipboard

Intermittent - job state not updated

Open ymoisan opened this issue 3 years ago • 4 comments

This situation I observed on a few occasions over the last few days especially for long jobs sent out in a batch. The state sometimes remains PENDING and does not update to RUNNING.

Initial job state change updates fine :

Out[1]:
[SlurmJob<job_id=67172758_0, task_id=0, state="UNKNOWN">,
 SlurmJob<job_id=67172758_1, task_id=0, state="UNKNOWN">,
 SlurmJob<job_id=67172758_2, task_id=0, state="UNKNOWN">]

Out[2]:
[SlurmJob<job_id=67172758_0, task_id=0, state="RUNNING">,
 SlurmJob<job_id=67172758_1, task_id=0, state="PENDING">,
 SlurmJob<job_id=67172758_2, task_id=0, state="PENDING">]

Then

Out[8]:
[SlurmJob<job_id=67172758_0, task_id=0, state="COMPLETED">,
 SlurmJob<job_id=67172758_1, task_id=0, state="PENDING">,
 SlurmJob<job_id=67172758_2, task_id=0, state="PENDING">]

Job 67172758_1 stays pending even though I know that it is running.

Luckily that job ran into an error and it's status was updated, which triggered the next job's status to be updated too :

[SlurmJob<job_id=67172758_0, task_id=0, state="COMPLETED">,
 SlurmJob<job_id=67172758_1, task_id=0, state="OUT_OF_MEMORY">,
 SlurmJob<job_id=67172758_2, task_id=0, state="RUNNING">]

I have not validated that all statuses get updated at the end of the batch but while the batch job is running some state changes don't make it.

ymoisan avatar Apr 11 '22 18:04 ymoisan

Hi thanks for reporting, but can you provide more context ?

Do you have errors in the console mentioning sacct ? what happens if you call directly sacct -j $JOBID from a terminal ? what happens if you call job.get_info(mode="force") ?

gwenzek avatar Apr 12 '22 08:04 gwenzek

Q1 No

Q2 Below is a series of screeshots with submitit jobs on the left and sacct on the right; the top image shows the beginning of the batch of 3 sub jobs. I have consistently noticed the switch from PENDING to RUNNING for the second job doesn't get triggered. I specifically refresh my submitit job list and nothing changes even after sacct shows otherwise. The bottom image shows the third job's status got updated to RUNNING but only after an error happened in the second job. It's as though "COMPLETED" does not trigger the next job's status. I could try and see if I can get my first job to fail and see if the second job's status gets updated but that's a pain. I'm also working remotely so it's complicated to get a debugger working.

Q3 The version of the code I'm using already has return self.watcher.get_info(self.job_id, mode="force"). I also changed get_state in slurm.py to "force" instead of "standard".

I'm in SLURM 21.08.

HTH

submitit-1679 4

ymoisan avatar Apr 12 '22 18:04 ymoisan

so the screenshots 1 and 3 show up to date information, while screenshot 2 seems to show information which is a bit stale: job 0 is correctly marked as completed, job 2 is correctly marked as pending and job 1 is wrongly marked as pending while running.

Note in particular that printing a job object doesn't force a refresh of the information. Only job.get_info does that. Could you share the code you use to refresh the state of the jobs ?

gwenzek avatar May 03 '22 09:05 gwenzek

Just printing the job object repeatedly. Checking the state property of job items does the job. Still, it's funny job2 doesn't get updated as long as job 3 isn't launched. I guess you can close this issue.

Thank you.

ymoisan avatar May 11 '22 17:05 ymoisan