pyslurm icon indicating copy to clipboard operation
pyslurm copied to clipboard

Make slurmdb_jobs().get take a list usernames as an argument

Open 0luhancheng0 opened this issue 5 years ago • 9 comments

Details

  • Slurm Version:19.05.4
  • Python Version:2.7.12
  • Cython Version:0.29.14
  • PySlurm Branch:19-05-0
  • Linux Distribution:centos-7

Issue

the get method in slurmdb_jobs take very long time to run. the following record is generated by cProfile. The function call took 691 seconds to complete.

        1  691.204  691.204  691.204  691.204 {method 'get' of 'pyslurm.pyslurm.slurmdb_jobs' objects}

While the equivalent sacct only took around 0.2 seconds

time sacct -S 2019-09-01
real	0m0.282s
user	0m0.054s
sys	0m0.011s

0luhancheng0 avatar Dec 05 '19 03:12 0luhancheng0

How many jobs do you have?

giovtorres avatar Dec 05 '19 03:12 giovtorres

You can also try using line_profiler so we can narrow down which lines are the slowest

giovtorres avatar Dec 05 '19 03:12 giovtorres

the sacct returns about 1800 jobs. I think get method in slurmdb_jobs will return all jobs from all users, which might be the reason why it is so slow?

$ sacct -S 2019-09-01 | wc -l
1812

0luhancheng0 avatar Dec 05 '19 03:12 0luhancheng0

1812 is not that much. I'm curious why. If you can, I'd try the line_profiler suggested above if you don't mind.

giovtorres avatar Dec 05 '19 03:12 giovtorres

OK, the slurmdb_jobs return 700k jobs

>>> h = hist.get(starttime='2019-09-01')
>>> len(h)
701175

0luhancheng0 avatar Dec 05 '19 03:12 0luhancheng0

Here is the output from line profiler

Timer unit: 1e-06 s

Total time: 361.231 s
File: pyslurm/pyslurm.pyx
Function: get at line 5316

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  5316                                               def get(self, jobids=[], starttime=0, endtime=0):
  5317                                                   u"""Get Slurmdb information about some jobs.
  5318                                           
  5319                                                   Input formats for start and end times:
  5320                                                   *   today or tomorrow
  5321                                                   *   midnight, noon, teatime (4PM)
  5322                                                   *   HH:MM [AM|PM]
  5323                                                   *   MMDDYY or MM/DD/YY or MM.DD.YY
  5324                                                   *   YYYY-MM-DD[THH[:MM[:SS]]]
  5325                                                   *   now + count [minutes | hours | days | weeks]
  5326                                                   *
  5327                                                   Invalid time input results in message to stderr and return value of
  5328                                                   zero.
  5329                                           
  5330                                                   :param jobids: Ids of the jobs to search. Defaults to all jobs.
  5331                                                   :param starttime: Select jobs eligible after this timestamp
  5332                                                   :param endtime: Select jobs eligible before this timestamp
  5333                                                   :returns: Dictionary whose key is the JOBS ID
  5334                                                   :rtype: `dict`
  5335                                                   """
  5336                                                   cdef:
  5337         1          4.0      4.0      0.0              int i = 0
  5338         1          0.0      0.0      0.0              int listNum = 0
  5339         1          0.0      0.0      0.0              int apiError = 0
  5340         1          0.0      0.0      0.0              dict J_dict = {}
  5341                                                       slurm.List JOBSList
  5342         1          0.0      0.0      0.0              slurm.ListIterator iters = NULL
  5343                                           
  5344         1          1.0      1.0      0.0          if jobids:
  5345                                                       self.job_cond.step_list = slurm.slurm_list_create(NULL)
  5346                                                       for _jobid in jobids:
  5347                                                           if isinstance(_jobid, int) or isinstance(_jobid, long):
  5348                                                               _jobid = str(_jobid).encode("UTF-8")
  5349                                                           else:
  5350                                                               _jobid = _jobid.encode("UTF-8")
  5351                                                           slurm.slurm_addto_step_list(self.job_cond.step_list, _jobid)
  5352                                           
  5353         1          1.0      1.0      0.0          if starttime:
  5354         1         38.0     38.0      0.0              self.job_cond.usage_start = slurm.slurm_parse_time(starttime, 1)
  5355         1          3.0      3.0      0.0              errno = slurm.slurm_get_errno()
  5356         1          1.0      1.0      0.0              if errno == slurm.ESLURM_INVALID_TIME_VALUE:
  5357                                                           raise ValueError(slurm.slurm_strerror(errno), errno)
  5358                                           
  5359         1          1.0      1.0      0.0          if endtime:
  5360                                                       self.job_cond.usage_end = slurm.slurm_parse_time(endtime, 1)
  5361                                                       errno = slurm.slurm_get_errno()
  5362                                                       if errno == slurm.ESLURM_INVALID_TIME_VALUE:
  5363                                                           raise ValueError(slurm.slurm_strerror(errno), errno)
  5364                                           
  5365         1  339186305.0 339186305.0     93.9          JOBSList = slurm.slurmdb_jobs_get(self.db_conn, self.job_cond)
  5366                                           
  5367         1          5.0      5.0      0.0          if JOBSList is NULL:
  5368                                                       apiError = slurm.slurm_get_errno()
  5369                                                       raise ValueError(slurm.slurm_strerror(apiError), apiError)
  5370                                           
  5371         1          2.0      2.0      0.0          listNum = slurm.slurm_list_count(JOBSList)
  5372         1          3.0      3.0      0.0          iters = slurm.slurm_list_iterator_create(JOBSList)
  5373                                           
  5374         1          1.0      1.0      0.0          for i in range(listNum):
  5375    701797     289661.0      0.4      0.1              job = <slurm.slurmdb_job_rec_t *>slurm.slurm_list_next(iters)
  5376                                           
  5377    701797     363685.0      0.5      0.1              JOBS_info = {}
  5378    701797     222075.0      0.3      0.1              if job is not NULL:
  5379    701797     298015.0      0.4      0.1                  jobid = job.jobid
  5380    701797     754165.0      1.1      0.2                  JOBS_info[u'account'] = slurm.stringOrNone(job.account, '')
  5381    701797     525496.0      0.7      0.1                  JOBS_info[u'allocated_gres'] = slurm.stringOrNone(job.alloc_gres, '')
  5382    701797     240712.0      0.3      0.1                  JOBS_info[u'allocated_nodes'] = job.alloc_nodes
  5383    701797     240928.0      0.3      0.1                  JOBS_info[u'array_job_id'] = job.array_job_id
  5384    701797     243614.0      0.3      0.1                  JOBS_info[u'array_max_tasks'] = job.array_max_tasks
  5385    701797     330985.0      0.5      0.1                  JOBS_info[u'array_task_id'] = job.array_task_id
  5386    701797     438708.0      0.6      0.1                  JOBS_info[u'array_task_str'] = slurm.stringOrNone(job.array_task_str, '')
  5387    701797     241542.0      0.3      0.1                  JOBS_info[u'associd'] = job.associd
  5388    701797     434178.0      0.6      0.1                  JOBS_info[u'blockid'] = slurm.stringOrNone(job.blockid, '')
  5389    701797     594449.0      0.8      0.2                  JOBS_info[u'cluster'] = slurm.stringOrNone(job.cluster, '')
  5390    701797     239231.0      0.3      0.1                  JOBS_info[u'derived_ec'] = job.derived_ec
  5391    701797     428321.0      0.6      0.1                  JOBS_info[u'derived_es'] = slurm.stringOrNone(job.derived_es, '')
  5392    701797     245760.0      0.4      0.1                  JOBS_info[u'elapsed'] = job.elapsed
  5393    701797     245991.0      0.4      0.1                  JOBS_info[u'eligible'] = job.eligible
  5394    701797     255902.0      0.4      0.1                  JOBS_info[u'end'] = job.end
  5395    701797     262542.0      0.4      0.1                  JOBS_info[u'exit_code'] = job.exitcode
  5396    701797     237194.0      0.3      0.1                  JOBS_info[u'gid'] = job.gid
  5397    701797     255011.0      0.4      0.1                  JOBS_info[u'jobid'] = job.jobid
  5398    701797     651020.0      0.9      0.2                  JOBS_info[u'jobname'] = slurm.stringOrNone(job.jobname, '')
  5399    701797     247941.0      0.4      0.1                  JOBS_info[u'lft'] = job.lft
  5400    701797     598449.0      0.9      0.2                  JOBS_info[u'partition'] = slurm.stringOrNone(job.partition, '')
  5401    701797    1851374.0      2.6      0.5                  JOBS_info[u'nodes'] = slurm.stringOrNone(job.nodes, '')
  5402    701797     278869.0      0.4      0.1                  JOBS_info[u'priority'] = job.priority
  5403    701797     256016.0      0.4      0.1                  JOBS_info[u'qosid'] = job.qosid
  5404    701797     259452.0      0.4      0.1                  JOBS_info[u'req_cpus'] = job.req_cpus
  5405    701797     546668.0      0.8      0.2                  JOBS_info[u'req_gres'] = slurm.stringOrNone(job.req_gres, '')
  5406    701797     242492.0      0.3      0.1                  JOBS_info[u'req_mem'] = job.req_mem
  5407    701797     256325.0      0.4      0.1                  JOBS_info[u'requid'] = job.requid
  5408    701797     249375.0      0.4      0.1                  JOBS_info[u'resvid'] = job.resvid
  5409    701797     440921.0      0.6      0.1                  JOBS_info[u'resv_name'] = slurm.stringOrNone(job.resv_name,'')
  5410    701797     249159.0      0.4      0.1                  JOBS_info[u'show_full'] = job.show_full
  5411    701797     239328.0      0.3      0.1                  JOBS_info[u'start'] = job.start
  5412    701797     257794.0      0.4      0.1                  JOBS_info[u'state'] = job.state
  5413    701797     283257.0      0.4      0.1                  JOBS_info[u'state_str'] = slurm.slurm_job_state_string(job.state)
  5414                                           
  5415    701797     255687.0      0.4      0.1                  JOBS_info[u'stat_actual_cpufreq'] = job.stats.act_cpufreq
  5416                                           
  5417    701797     238855.0      0.3      0.1                  JOBS_info[u'steps'] = "Not filled, string should be handled"
  5418    701797     268942.0      0.4      0.1                  JOBS_info[u'submit'] = job.submit
  5419    701797     250232.0      0.4      0.1                  JOBS_info[u'suspended'] = job.suspended
  5420    701797     251090.0      0.4      0.1                  JOBS_info[u'sys_cpu_sec'] = job.sys_cpu_sec
  5421    701797     238501.0      0.3      0.1                  JOBS_info[u'sys_cpu_usec'] = job.sys_cpu_usec
  5422    701797     255405.0      0.4      0.1                  JOBS_info[u'timelimit'] = job.timelimit
  5423    701797     243494.0      0.3      0.1                  JOBS_info[u'tot_cpu_sec'] = job.tot_cpu_sec
  5424    701797     242152.0      0.3      0.1                  JOBS_info[u'tot_cpu_usec'] = job.tot_cpu_usec
  5425    701797     244246.0      0.3      0.1                  JOBS_info[u'track_steps'] = job.track_steps
  5426    701797     954238.0      1.4      0.3                  JOBS_info[u'tres_alloc_str'] = slurm.stringOrNone(job.tres_alloc_str,'')
  5427    701797     609457.0      0.9      0.2                  JOBS_info[u'tres_req_str'] = slurm.stringOrNone(job.tres_req_str,'')
  5428    701797     250783.0      0.4      0.1                  JOBS_info[u'uid'] = job.uid
  5429    701797     435276.0      0.6      0.1                  JOBS_info[u'used_gres'] = slurm.stringOrNone(job.used_gres, '')
  5430    701797     570812.0      0.8      0.2                  JOBS_info[u'user'] = slurm.stringOrNone(job.user,'')
  5431    701797     249176.0      0.4      0.1                  JOBS_info[u'user_cpu_sec'] = job.user_cpu_sec
  5432    701797     248790.0      0.4      0.1                  JOBS_info[u'user_cpu_sec'] = job.user_cpu_usec
  5433    701797     523683.0      0.7      0.1                  JOBS_info[u'wckey'] = slurm.stringOrNone(job.wckey, '')
  5434    701797     241348.0      0.3      0.1                  JOBS_info[u'wckeyid'] = job.wckeyid
  5435    701797     264793.0      0.4      0.1                  J_dict[jobid] = JOBS_info
  5436                                           
  5437         1          4.0      4.0      0.0          slurm.slurm_list_iterator_destroy(iters)
  5438         1     910580.0 910580.0      0.3          slurm.slurm_list_destroy(JOBSList)
  5439         1          2.0      2.0      0.0          return J_dict


0luhancheng0 avatar Dec 05 '19 08:12 0luhancheng0

Thank you for doing this!

It looks like the DB call here is the culprit:

  5365         1  339186305.0 339186305.0     93.9          JOBSList = slurm.slurmdb_jobs_get(self.db_conn, self.job_cond)

The exact same query of 700K+ job records using sacct takes how long?

giovtorres avatar Dec 05 '19 12:12 giovtorres

the command for generating same records should be

time sacct -a -S 2019-09-01

Sorry that I was using wrong command in my previous comments I should have specified -a option which display all users job.

The command takes

real	8m51.382s
user	0m50.494s
sys	0m18.220s

The sacct does not really provide speedup compare to db call.

I think it would be great if you can make slurmdb_jobs().get take a list usernames as an argument (-u option in sacct) since users most likely going to be focusing on their own job history.

Thank you for helping

0luhancheng0 avatar Dec 05 '19 13:12 0luhancheng0

@giovtorres Any movement on this, still required ? It does make sense but I would need to review sacct code.

gingergeeks avatar Nov 21 '21 19:11 gingergeeks

Providing a list of users is now supported via the new API for querying database Jobs. The new classes related to this are pyslurm.db.Job, pyslurm.db.Jobs and pyslurm.db.JobSearchFilter

Check out the example provided here

tazend avatar May 05 '23 21:05 tazend