submitit
submitit copied to clipboard
Too many sacct requests for batched tasks
I need to submit thousands of tasks, and due to the max size limit of job array, the tasks are devided into groups and there will be one job array for each group:
submitted_jobs = []
for group_idx,group_jobs_to_run in enumerate(groups):
with excutor.batch(): # a job_array for each group
for idx in group_jobs_to_run: # note the idx is the user defined one, not the slurm job id
task_args,task_kwargs = get_task_args(idx)
job = excutor.submit(slurm_tasks,*task_args,**task_kwargs)
submitted_jobs.append(job)
# wait for results
_ = [job.wait() for job in submitted_jobs]
I use job.wait() to wait for all tasks to complete, however, I found it usually trigger the user rpc limit on my slurm cluster, sometimes even stuck the whole cluster, and I got the warning:
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:07:41,285) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:07:41,285) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:08:51,594) - Call #7 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:08:51,594) - Call #7 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:15:56,718) - Call #9 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:15:56,718) - Call #9 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 22:25:23,108) - Call #10 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 22:25:23,108) - Call #10 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 23:15:34,829) - Call #15 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 23:15:34,829) - Call #15 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
sacct: Failed to load running jobs from slurmctld
submitit WARNING (2024-01-10 23:25:36,817) - Call #16 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2024-01-10 23:25:36,817) - Call #16 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '2925170', '-j', '2925216', '-j', '2925161', '-j', '2925152', '-j', '2925198', '-j', '2925207', '-j', '2925189', '-j', '2925180']' returned non-zero exit status 1., status may be inaccurate.
It seems that the submitit asked for too many duplicated requests at the same time that exceed the user rpc limit on my clustere. The JOB.wait() method is expected to run in a blocking way that may not request the task's state in parallel, and I'm not sure what machenism in submitit caused the duplicated slurm call.