pyslurm
pyslurm copied to clipboard
Make slurmdb_jobs().get take a list usernames as an argument
Details
- Slurm Version:19.05.4
- Python Version:2.7.12
- Cython Version:0.29.14
- PySlurm Branch:19-05-0
- Linux Distribution:centos-7
Issue
the get method in slurmdb_jobs take very long time to run. the following record is generated by cProfile. The function call took 691 seconds to complete.
1 691.204 691.204 691.204 691.204 {method 'get' of 'pyslurm.pyslurm.slurmdb_jobs' objects}
While the equivalent sacct only took around 0.2 seconds
time sacct -S 2019-09-01
real 0m0.282s
user 0m0.054s
sys 0m0.011s
How many jobs do you have?
You can also try using line_profiler so we can narrow down which lines are the slowest
the sacct returns about 1800 jobs. I think get method in slurmdb_jobs will return all jobs from all users, which might be the reason why it is so slow?
$ sacct -S 2019-09-01 | wc -l
1812
1812 is not that much. I'm curious why. If you can, I'd try the line_profiler suggested above if you don't mind.
OK, the slurmdb_jobs return 700k jobs
>>> h = hist.get(starttime='2019-09-01')
>>> len(h)
701175
Here is the output from line profiler
Timer unit: 1e-06 s
Total time: 361.231 s
File: pyslurm/pyslurm.pyx
Function: get at line 5316
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5316 def get(self, jobids=[], starttime=0, endtime=0):
5317 u"""Get Slurmdb information about some jobs.
5318
5319 Input formats for start and end times:
5320 * today or tomorrow
5321 * midnight, noon, teatime (4PM)
5322 * HH:MM [AM|PM]
5323 * MMDDYY or MM/DD/YY or MM.DD.YY
5324 * YYYY-MM-DD[THH[:MM[:SS]]]
5325 * now + count [minutes | hours | days | weeks]
5326 *
5327 Invalid time input results in message to stderr and return value of
5328 zero.
5329
5330 :param jobids: Ids of the jobs to search. Defaults to all jobs.
5331 :param starttime: Select jobs eligible after this timestamp
5332 :param endtime: Select jobs eligible before this timestamp
5333 :returns: Dictionary whose key is the JOBS ID
5334 :rtype: `dict`
5335 """
5336 cdef:
5337 1 4.0 4.0 0.0 int i = 0
5338 1 0.0 0.0 0.0 int listNum = 0
5339 1 0.0 0.0 0.0 int apiError = 0
5340 1 0.0 0.0 0.0 dict J_dict = {}
5341 slurm.List JOBSList
5342 1 0.0 0.0 0.0 slurm.ListIterator iters = NULL
5343
5344 1 1.0 1.0 0.0 if jobids:
5345 self.job_cond.step_list = slurm.slurm_list_create(NULL)
5346 for _jobid in jobids:
5347 if isinstance(_jobid, int) or isinstance(_jobid, long):
5348 _jobid = str(_jobid).encode("UTF-8")
5349 else:
5350 _jobid = _jobid.encode("UTF-8")
5351 slurm.slurm_addto_step_list(self.job_cond.step_list, _jobid)
5352
5353 1 1.0 1.0 0.0 if starttime:
5354 1 38.0 38.0 0.0 self.job_cond.usage_start = slurm.slurm_parse_time(starttime, 1)
5355 1 3.0 3.0 0.0 errno = slurm.slurm_get_errno()
5356 1 1.0 1.0 0.0 if errno == slurm.ESLURM_INVALID_TIME_VALUE:
5357 raise ValueError(slurm.slurm_strerror(errno), errno)
5358
5359 1 1.0 1.0 0.0 if endtime:
5360 self.job_cond.usage_end = slurm.slurm_parse_time(endtime, 1)
5361 errno = slurm.slurm_get_errno()
5362 if errno == slurm.ESLURM_INVALID_TIME_VALUE:
5363 raise ValueError(slurm.slurm_strerror(errno), errno)
5364
5365 1 339186305.0 339186305.0 93.9 JOBSList = slurm.slurmdb_jobs_get(self.db_conn, self.job_cond)
5366
5367 1 5.0 5.0 0.0 if JOBSList is NULL:
5368 apiError = slurm.slurm_get_errno()
5369 raise ValueError(slurm.slurm_strerror(apiError), apiError)
5370
5371 1 2.0 2.0 0.0 listNum = slurm.slurm_list_count(JOBSList)
5372 1 3.0 3.0 0.0 iters = slurm.slurm_list_iterator_create(JOBSList)
5373
5374 1 1.0 1.0 0.0 for i in range(listNum):
5375 701797 289661.0 0.4 0.1 job = <slurm.slurmdb_job_rec_t *>slurm.slurm_list_next(iters)
5376
5377 701797 363685.0 0.5 0.1 JOBS_info = {}
5378 701797 222075.0 0.3 0.1 if job is not NULL:
5379 701797 298015.0 0.4 0.1 jobid = job.jobid
5380 701797 754165.0 1.1 0.2 JOBS_info[u'account'] = slurm.stringOrNone(job.account, '')
5381 701797 525496.0 0.7 0.1 JOBS_info[u'allocated_gres'] = slurm.stringOrNone(job.alloc_gres, '')
5382 701797 240712.0 0.3 0.1 JOBS_info[u'allocated_nodes'] = job.alloc_nodes
5383 701797 240928.0 0.3 0.1 JOBS_info[u'array_job_id'] = job.array_job_id
5384 701797 243614.0 0.3 0.1 JOBS_info[u'array_max_tasks'] = job.array_max_tasks
5385 701797 330985.0 0.5 0.1 JOBS_info[u'array_task_id'] = job.array_task_id
5386 701797 438708.0 0.6 0.1 JOBS_info[u'array_task_str'] = slurm.stringOrNone(job.array_task_str, '')
5387 701797 241542.0 0.3 0.1 JOBS_info[u'associd'] = job.associd
5388 701797 434178.0 0.6 0.1 JOBS_info[u'blockid'] = slurm.stringOrNone(job.blockid, '')
5389 701797 594449.0 0.8 0.2 JOBS_info[u'cluster'] = slurm.stringOrNone(job.cluster, '')
5390 701797 239231.0 0.3 0.1 JOBS_info[u'derived_ec'] = job.derived_ec
5391 701797 428321.0 0.6 0.1 JOBS_info[u'derived_es'] = slurm.stringOrNone(job.derived_es, '')
5392 701797 245760.0 0.4 0.1 JOBS_info[u'elapsed'] = job.elapsed
5393 701797 245991.0 0.4 0.1 JOBS_info[u'eligible'] = job.eligible
5394 701797 255902.0 0.4 0.1 JOBS_info[u'end'] = job.end
5395 701797 262542.0 0.4 0.1 JOBS_info[u'exit_code'] = job.exitcode
5396 701797 237194.0 0.3 0.1 JOBS_info[u'gid'] = job.gid
5397 701797 255011.0 0.4 0.1 JOBS_info[u'jobid'] = job.jobid
5398 701797 651020.0 0.9 0.2 JOBS_info[u'jobname'] = slurm.stringOrNone(job.jobname, '')
5399 701797 247941.0 0.4 0.1 JOBS_info[u'lft'] = job.lft
5400 701797 598449.0 0.9 0.2 JOBS_info[u'partition'] = slurm.stringOrNone(job.partition, '')
5401 701797 1851374.0 2.6 0.5 JOBS_info[u'nodes'] = slurm.stringOrNone(job.nodes, '')
5402 701797 278869.0 0.4 0.1 JOBS_info[u'priority'] = job.priority
5403 701797 256016.0 0.4 0.1 JOBS_info[u'qosid'] = job.qosid
5404 701797 259452.0 0.4 0.1 JOBS_info[u'req_cpus'] = job.req_cpus
5405 701797 546668.0 0.8 0.2 JOBS_info[u'req_gres'] = slurm.stringOrNone(job.req_gres, '')
5406 701797 242492.0 0.3 0.1 JOBS_info[u'req_mem'] = job.req_mem
5407 701797 256325.0 0.4 0.1 JOBS_info[u'requid'] = job.requid
5408 701797 249375.0 0.4 0.1 JOBS_info[u'resvid'] = job.resvid
5409 701797 440921.0 0.6 0.1 JOBS_info[u'resv_name'] = slurm.stringOrNone(job.resv_name,'')
5410 701797 249159.0 0.4 0.1 JOBS_info[u'show_full'] = job.show_full
5411 701797 239328.0 0.3 0.1 JOBS_info[u'start'] = job.start
5412 701797 257794.0 0.4 0.1 JOBS_info[u'state'] = job.state
5413 701797 283257.0 0.4 0.1 JOBS_info[u'state_str'] = slurm.slurm_job_state_string(job.state)
5414
5415 701797 255687.0 0.4 0.1 JOBS_info[u'stat_actual_cpufreq'] = job.stats.act_cpufreq
5416
5417 701797 238855.0 0.3 0.1 JOBS_info[u'steps'] = "Not filled, string should be handled"
5418 701797 268942.0 0.4 0.1 JOBS_info[u'submit'] = job.submit
5419 701797 250232.0 0.4 0.1 JOBS_info[u'suspended'] = job.suspended
5420 701797 251090.0 0.4 0.1 JOBS_info[u'sys_cpu_sec'] = job.sys_cpu_sec
5421 701797 238501.0 0.3 0.1 JOBS_info[u'sys_cpu_usec'] = job.sys_cpu_usec
5422 701797 255405.0 0.4 0.1 JOBS_info[u'timelimit'] = job.timelimit
5423 701797 243494.0 0.3 0.1 JOBS_info[u'tot_cpu_sec'] = job.tot_cpu_sec
5424 701797 242152.0 0.3 0.1 JOBS_info[u'tot_cpu_usec'] = job.tot_cpu_usec
5425 701797 244246.0 0.3 0.1 JOBS_info[u'track_steps'] = job.track_steps
5426 701797 954238.0 1.4 0.3 JOBS_info[u'tres_alloc_str'] = slurm.stringOrNone(job.tres_alloc_str,'')
5427 701797 609457.0 0.9 0.2 JOBS_info[u'tres_req_str'] = slurm.stringOrNone(job.tres_req_str,'')
5428 701797 250783.0 0.4 0.1 JOBS_info[u'uid'] = job.uid
5429 701797 435276.0 0.6 0.1 JOBS_info[u'used_gres'] = slurm.stringOrNone(job.used_gres, '')
5430 701797 570812.0 0.8 0.2 JOBS_info[u'user'] = slurm.stringOrNone(job.user,'')
5431 701797 249176.0 0.4 0.1 JOBS_info[u'user_cpu_sec'] = job.user_cpu_sec
5432 701797 248790.0 0.4 0.1 JOBS_info[u'user_cpu_sec'] = job.user_cpu_usec
5433 701797 523683.0 0.7 0.1 JOBS_info[u'wckey'] = slurm.stringOrNone(job.wckey, '')
5434 701797 241348.0 0.3 0.1 JOBS_info[u'wckeyid'] = job.wckeyid
5435 701797 264793.0 0.4 0.1 J_dict[jobid] = JOBS_info
5436
5437 1 4.0 4.0 0.0 slurm.slurm_list_iterator_destroy(iters)
5438 1 910580.0 910580.0 0.3 slurm.slurm_list_destroy(JOBSList)
5439 1 2.0 2.0 0.0 return J_dict
Thank you for doing this!
It looks like the DB call here is the culprit:
5365 1 339186305.0 339186305.0 93.9 JOBSList = slurm.slurmdb_jobs_get(self.db_conn, self.job_cond)
The exact same query of 700K+ job records using sacct takes how long?
the command for generating same records should be
time sacct -a -S 2019-09-01
Sorry that I was using wrong command in my previous comments I should have specified -a option which display all users job.
The command takes
real 8m51.382s
user 0m50.494s
sys 0m18.220s
The sacct does not really provide speedup compare to db call.
I think it would be great if you can make slurmdb_jobs().get take a list usernames as an argument (-u option in sacct) since users most likely going to be focusing on their own job history.
Thank you for helping
@giovtorres Any movement on this, still required ? It does make sense but I would need to review sacct code.
Providing a list of users is now supported via the new API for querying database Jobs. The new classes related to this are pyslurm.db.Job, pyslurm.db.Jobs and pyslurm.db.JobSearchFilter
Check out the example provided here