flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

filter 'flux jobs' output by Node ID

Open ryanday36 opened this issue 2 years ago • 13 comments

Related to #4185, it would be useful to be able to filter the output of flux jobs by node ID. I.e. to be able to easily answer questions like who has recently run on node 'fooX'? or how much time is left on the job(s) running on 'fooX'?

ryanday36 avatar Mar 03 '22 21:03 ryanday36

This is to help us at ops find trends. I.E, 20+ nodes suddenly crashed or OOM'd and they were all running a single or individual jobs by the same user. Being able to tell when a job ends will also help us determine when physical work can be initiated on a node so that we may plan accordingly. Especially on time sensitive work.

Raj-Bagri1 avatar Mar 03 '22 23:03 Raj-Bagri1

Being able to query all jobs ran on a node starting from a specific date till today would also be helpful to help pick up on trends for issues on specific nodes or in case some nodes had fallen out over the weekend and we would like to dig deeper during the next week to help identify underlying issues.

Raj-Bagri1 avatar Mar 03 '22 23:03 Raj-Bagri1

There is a huge opportunity for enhanced tools here since the full "resource set" for every job that ran will be available in the KVS or the job archive database. If there is equivalent functionality you use now in existing resources manager it would be very helpful to delineate some specific scenarios in this issue. Thanks!

grondo avatar Mar 03 '22 23:03 grondo

Here's something I just hacked together that could be the start of a solution here:

#!/usr/bin/python3

import sys
from datetime import datetime

import flux
from flux.job import job_list_inactive
from flux.hostlist import Hostlist
from flux.util import parse_datetime

host = sys.argv[1]
try:
    since = parse_datetime(sys.argv[2]).timestamp()
except:
    since = 0.0

date = datetime.fromtimestamp(since)
print(f"Looking for jobs on {host} since {date:%D %T}")

jobs = job_list_inactive(
    flux.Flux(),
    since=since,
    attrs=["ranks", "nodelist"],
    max_entries=0,
).get_jobinfos()

for job in jobs:
    if host in Hostlist(job.nodelist):
        print(job.id, job.nodelist)

e.g. get all the jobs that ran on fluke100 in the last 7 days:

 grondo@fluke108:~$ flux python jobs-by-node.py fluke100 -7d
Looking for jobs on fluke100 since 03/06/22 15:21:51
ƒPUpdfzsBNw fluke[6-16,18-23,25-56,58,60,62-65,67-78,80-91,93-103]
ƒPUJoPtLuyR fluke[60,62-65,67-78,80-91,93-103]
ƒPUJ2cxCxBy fluke[60,62-65,67-78,80-91,93-103]
ƒPUHHiqGH4T fluke[60,62-65,67-78,80-91,93-103]
ƒPKG3VMJens fluke[10-16,18-23,25-50,100]
ƒP3mnZZMoPm fluke100
ƒNoxd1vTd1H fluke[53-56,58,60,62-65,67-78,80-91,93-103]
ƒNowVsxonbR fluke[53-56,58,60,62-65,67-78,80-91,93-103]
ƒNovr2LdRBm fluke[53-56,58,60,62-65,67-78,80-91,93-103]
ƒNovJUGKxuV fluke[53-56,58,60,62-65,67-78,80-91,93-103]
ƒNjEyZYkycB fluke[6-16,18-23,25-56,58,60,62-65,67-78,80-91,93-103]
ƒNjEJPn5VhH fluke[94-103]

This script, and a similar one to return the amount of remaining time for active jobs by node, could possibly be used as temporary solutions until we have a more powerful query interface for jobs.

grondo avatar Mar 13 '22 23:03 grondo

Just added support for query of active jobs and their remaining time (I can rework this script to provide a better interface a bit later)

#!/usr/bin/python3

import sys
from datetime import datetime, timedelta

import flux
from flux.job import job_list_inactive, JobList
from flux.hostlist import Hostlist
from flux.util import parse_datetime

host = sys.argv[1]
try:
    since = parse_datetime(sys.argv[2]).timestamp()
except Exception as exc:
    since = 0.0

date = datetime.fromtimestamp(since)
print(f"Looking for jobs on {host} since {date:%D %T}")

handle = flux.Flux()
inactive = job_list_inactive(
    handle,
    since=since,
    attrs=["ranks", "nodelist"],
    max_entries=0,
)

active = JobList(
    handle,
    attrs=["ranks", "nodelist", "expiration", "state", "result"],
    max_entries=0,
    filters=["running"]
)

for job in inactive.get_jobinfos():
    if host in Hostlist(job.nodelist):
        print(job.id, job.nodelist)

for job in active.jobs():
    if host in Hostlist(job.nodelist):
        dt = timedelta(seconds=int(job.t_remaining))
        print(f"active job {job.id} on {job.nodelist} finishes in {dt}")
$ flux python jobs-by-node.py fluke100 -1d
Looking for jobs on fluke100 since 03/12/22 16:16:30
ƒPUpdfzsBNw fluke[6-16,18-23,25-56,58,60,62-65,67-78,80-91,93-103]
active job ƒPjGRCGKr9m on fluke[100-103] finishes in 0:04:36
active job ƒPjEddGoq1Z on fluke[100-103] finishes in 0:00:41

grondo avatar Mar 14 '22 00:03 grondo

Thanks for there examples @grondo! I just added support for hostlist filtering to the queue wrapper script. e.g.

[day36@fluke108:~]$ module use /usr/global/tools/flux_wrappers/modulefiles/
[day36@fluke108:~]$ module load flux_wrappers
[day36@fluke108:~]$ squeue -t all -w fluke[8-9]
               JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         ƒRxNhjEFw3u   default    sleep    day36 CD      00:00      2 fluke[7-8]
         ƒRdBf1NzaWK   default clomp_mp  moussa1 CD      00:00     81 fluke[7-16,18-23,25-56,58,60,62-65,67-78,81,83-91,93-97]
         ƒRdBf1NzaWK   default clomp_mp  moussa1 CD      00:00     81 fluke[7-16,18-23,25-56,58,60,62-65,67-78,81,83-91,93-97]
         ƒRdBevYnvF9   default clomp_mp  moussa1 CD      00:00     64 fluke[7-16,18-23,25-56,58,60,62-65,67-76]
         ƒRdBevYnvF9   default clomp_mp  moussa1 CD      00:00     64 fluke[7-16,18-23,25-56,58,60,62-65,67-76]
         ƒRdBeqfdHRH   default clomp_mp  moussa1 CD      00:00     49 fluke[7-16,18-23,25-56,58]
         ƒRdBeqfdHRH   default clomp_mp  moussa1 CD      00:00     49 fluke[7-16,18-23,25-56,58]
         ƒRdBekjVg2j   default clomp_mp  moussa1 CD      00:00     36 fluke[7-16,18-23,25-44]
         ƒRdBekjVg2j   default clomp_mp  moussa1 CD      00:00     36 fluke[7-16,18-23,25-44]
         ƒRdBdw62kjy   default amg_setu  moussa1 CD      00:00      1 fluke9
         ƒRdBdPNC823   default amg_138.  moussa1 CD      00:00      1 fluke8
...

(leaving off the -t all shows just the running and pending jobs)

ryanday36 avatar Mar 25 '22 17:03 ryanday36

linking to #3066

this issue makes me wonder if long term some general advanced querying mechanism may be in order

chu11 avatar Apr 04 '22 19:04 chu11

Linking to #4914, as that is the more likely "real" solution to this

chu11 avatar Apr 03 '23 00:04 chu11

any strong opinions on what a filter for "node id" should be called? I originally went with hosts, but now I don't like it. Machines we ran on are currently called the "nodelist" so --nodes might make more sense. But --nodelist might also be better. ehhh

chu11 avatar Dec 28 '23 23:12 chu11

I originally went with hosts, but now I don't like it.

It would be interesting to hear why you don't like it. Though I'd be fine with --nodelist.

so --nodes might make more sense. But --nodelist might also be better.

Is addition of a new option going to make it more difficult to support the advanced filter option that is supposed to be the real solution here? E.g. what would you do when the --nodelist option is mixed with a --filter option (or are we proposing a different command to use the RFC 35 query syntax? e.g. as proposed for flux-pgrep in #4915)

grondo avatar Dec 29 '23 00:12 grondo

It would be interesting to hear why you don't like it.

It's mostly my anal retentive side, because we call the list of nodes "nodelist" vs "hostlist" ... :shrug: So I sort of like the idea of the option being the name of the constraint operator as well, since it is for everything else.

Is addition of a new option going to make it more difficult to support the advanced filter option that is supposed to be the real solution here?

Honestly, i went back and forth on if the advanced filter/ query option (#5367) should go before this one ... so that's a TBD. This one could still come after that one.

chu11 avatar Dec 29 '23 00:12 chu11

This one could still come after that one.

Well my worry is that a --nodelist and --filter option could be redundant and require special handling if both are specified. If --filter is supported then a separate --nodelist option isn't even needed, and could be supported, etc. It will also be very useful to be able to query jobs that ran on a given node at a certain time or in a given window.

I could help work on the syntax parser by adapting the pgrep experimental one.

grondo avatar Dec 29 '23 01:12 grondo

If --filter is supported then a separate --nodelist option isn't even needed, and could be supported, etc. It will also be very useful to be able to query jobs that ran on a given node at a certain time or in a given window.

Yeah, my thoughts as well. I honestly was waffling on which to start first, and chose this one mostly b/c I didn't quite have an idea on how to pass a query from the user into the JobList() object. But we can discuss that design in the other issue.

chu11 avatar Dec 29 '23 04:12 chu11