flux-core
flux-core copied to clipboard
filter 'flux jobs' output by Node ID
Related to #4185, it would be useful to be able to filter the output of flux jobs
by node ID. I.e. to be able to easily answer questions like who has recently run on node 'fooX'? or how much time is left on the job(s) running on 'fooX'?
This is to help us at ops find trends. I.E, 20+ nodes suddenly crashed or OOM'd and they were all running a single or individual jobs by the same user. Being able to tell when a job ends will also help us determine when physical work can be initiated on a node so that we may plan accordingly. Especially on time sensitive work.
Being able to query all jobs ran on a node starting from a specific date till today would also be helpful to help pick up on trends for issues on specific nodes or in case some nodes had fallen out over the weekend and we would like to dig deeper during the next week to help identify underlying issues.
There is a huge opportunity for enhanced tools here since the full "resource set" for every job that ran will be available in the KVS or the job archive database. If there is equivalent functionality you use now in existing resources manager it would be very helpful to delineate some specific scenarios in this issue. Thanks!
Here's something I just hacked together that could be the start of a solution here:
#!/usr/bin/python3
import sys
from datetime import datetime
import flux
from flux.job import job_list_inactive
from flux.hostlist import Hostlist
from flux.util import parse_datetime
host = sys.argv[1]
try:
since = parse_datetime(sys.argv[2]).timestamp()
except:
since = 0.0
date = datetime.fromtimestamp(since)
print(f"Looking for jobs on {host} since {date:%D %T}")
jobs = job_list_inactive(
flux.Flux(),
since=since,
attrs=["ranks", "nodelist"],
max_entries=0,
).get_jobinfos()
for job in jobs:
if host in Hostlist(job.nodelist):
print(job.id, job.nodelist)
e.g. get all the jobs that ran on fluke100 in the last 7 days:
grondo@fluke108:~$ flux python jobs-by-node.py fluke100 -7d
Looking for jobs on fluke100 since 03/06/22 15:21:51
ƒPUpdfzsBNw fluke[6-16,18-23,25-56,58,60,62-65,67-78,80-91,93-103]
ƒPUJoPtLuyR fluke[60,62-65,67-78,80-91,93-103]
ƒPUJ2cxCxBy fluke[60,62-65,67-78,80-91,93-103]
ƒPUHHiqGH4T fluke[60,62-65,67-78,80-91,93-103]
ƒPKG3VMJens fluke[10-16,18-23,25-50,100]
ƒP3mnZZMoPm fluke100
ƒNoxd1vTd1H fluke[53-56,58,60,62-65,67-78,80-91,93-103]
ƒNowVsxonbR fluke[53-56,58,60,62-65,67-78,80-91,93-103]
ƒNovr2LdRBm fluke[53-56,58,60,62-65,67-78,80-91,93-103]
ƒNovJUGKxuV fluke[53-56,58,60,62-65,67-78,80-91,93-103]
ƒNjEyZYkycB fluke[6-16,18-23,25-56,58,60,62-65,67-78,80-91,93-103]
ƒNjEJPn5VhH fluke[94-103]
This script, and a similar one to return the amount of remaining time for active jobs by node, could possibly be used as temporary solutions until we have a more powerful query interface for jobs.
Just added support for query of active jobs and their remaining time (I can rework this script to provide a better interface a bit later)
#!/usr/bin/python3
import sys
from datetime import datetime, timedelta
import flux
from flux.job import job_list_inactive, JobList
from flux.hostlist import Hostlist
from flux.util import parse_datetime
host = sys.argv[1]
try:
since = parse_datetime(sys.argv[2]).timestamp()
except Exception as exc:
since = 0.0
date = datetime.fromtimestamp(since)
print(f"Looking for jobs on {host} since {date:%D %T}")
handle = flux.Flux()
inactive = job_list_inactive(
handle,
since=since,
attrs=["ranks", "nodelist"],
max_entries=0,
)
active = JobList(
handle,
attrs=["ranks", "nodelist", "expiration", "state", "result"],
max_entries=0,
filters=["running"]
)
for job in inactive.get_jobinfos():
if host in Hostlist(job.nodelist):
print(job.id, job.nodelist)
for job in active.jobs():
if host in Hostlist(job.nodelist):
dt = timedelta(seconds=int(job.t_remaining))
print(f"active job {job.id} on {job.nodelist} finishes in {dt}")
$ flux python jobs-by-node.py fluke100 -1d
Looking for jobs on fluke100 since 03/12/22 16:16:30
ƒPUpdfzsBNw fluke[6-16,18-23,25-56,58,60,62-65,67-78,80-91,93-103]
active job ƒPjGRCGKr9m on fluke[100-103] finishes in 0:04:36
active job ƒPjEddGoq1Z on fluke[100-103] finishes in 0:00:41
Thanks for there examples @grondo! I just added support for hostlist filtering to the queue wrapper script. e.g.
[day36@fluke108:~]$ module use /usr/global/tools/flux_wrappers/modulefiles/
[day36@fluke108:~]$ module load flux_wrappers
[day36@fluke108:~]$ squeue -t all -w fluke[8-9]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
ƒRxNhjEFw3u default sleep day36 CD 00:00 2 fluke[7-8]
ƒRdBf1NzaWK default clomp_mp moussa1 CD 00:00 81 fluke[7-16,18-23,25-56,58,60,62-65,67-78,81,83-91,93-97]
ƒRdBf1NzaWK default clomp_mp moussa1 CD 00:00 81 fluke[7-16,18-23,25-56,58,60,62-65,67-78,81,83-91,93-97]
ƒRdBevYnvF9 default clomp_mp moussa1 CD 00:00 64 fluke[7-16,18-23,25-56,58,60,62-65,67-76]
ƒRdBevYnvF9 default clomp_mp moussa1 CD 00:00 64 fluke[7-16,18-23,25-56,58,60,62-65,67-76]
ƒRdBeqfdHRH default clomp_mp moussa1 CD 00:00 49 fluke[7-16,18-23,25-56,58]
ƒRdBeqfdHRH default clomp_mp moussa1 CD 00:00 49 fluke[7-16,18-23,25-56,58]
ƒRdBekjVg2j default clomp_mp moussa1 CD 00:00 36 fluke[7-16,18-23,25-44]
ƒRdBekjVg2j default clomp_mp moussa1 CD 00:00 36 fluke[7-16,18-23,25-44]
ƒRdBdw62kjy default amg_setu moussa1 CD 00:00 1 fluke9
ƒRdBdPNC823 default amg_138. moussa1 CD 00:00 1 fluke8
...
(leaving off the -t all
shows just the running and pending jobs)
linking to #3066
this issue makes me wonder if long term some general advanced querying mechanism may be in order
Linking to #4914, as that is the more likely "real" solution to this
any strong opinions on what a filter for "node id" should be called? I originally went with hosts
, but now I don't like it. Machines we ran on are currently called the "nodelist" so --nodes
might make more sense. But --nodelist
might also be better. ehhh
I originally went with hosts, but now I don't like it.
It would be interesting to hear why you don't like it. Though I'd be fine with --nodelist
.
so --nodes might make more sense. But --nodelist might also be better.
Is addition of a new option going to make it more difficult to support the advanced filter option that is supposed to be the real solution here? E.g. what would you do when the --nodelist
option is mixed with a --filter
option (or are we proposing a different command to use the RFC 35 query syntax? e.g. as proposed for flux-pgrep
in #4915)
It would be interesting to hear why you don't like it.
It's mostly my anal retentive side, because we call the list of nodes "nodelist" vs "hostlist" ... :shrug: So I sort of like the idea of the option being the name of the constraint operator as well, since it is for everything else.
Is addition of a new option going to make it more difficult to support the advanced filter option that is supposed to be the real solution here?
Honestly, i went back and forth on if the advanced filter/ query option (#5367) should go before this one ... so that's a TBD. This one could still come after that one.
This one could still come after that one.
Well my worry is that a --nodelist
and --filter
option could be redundant and require special handling if both are specified. If --filter
is supported then a separate --nodelist
option isn't even needed, and
could be supported, etc. It will also be very useful to be able to query jobs that ran on a given node at a certain time or in a given window.
I could help work on the syntax parser by adapting the pgrep experimental one.
If --filter is supported then a separate --nodelist option isn't even needed, and could be supported, etc. It will also be very useful to be able to query jobs that ran on a given node at a certain time or in a given window.
Yeah, my thoughts as well. I honestly was waffling on which to start first, and chose this one mostly b/c I didn't quite have an idea on how to pass a query from the user into the JobList()
object. But we can discuss that design in the other issue.