flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

job-info: update watchers lists may not scale well

Open chu11 opened this issue 8 months ago • 7 comments

In #6709, it's noted that job-info cancellation lists can be slow if there are alot of watchers. This mostly scales poorly with number of total job submissions and things like that (i.e. the total number of thingies watching eventlogs and what not).

update watchers in job-info are different, they really only scale poorly with total number of running jobs (i.e thingies running right now).

This area requires a bit more re-work if we want to scale it out, the msglist should become a zlistx_t and we want to probably add "matchtag lookup keys" to some struct similar to what was done in #6709.

This appears to be a lower priority scaling issue so I'm separating it out to potentially be done at a later time if it becomes a problem.

chu11 avatar Mar 17 '25 23:03 chu11

There are two "update watch" loops of concern at scale

1 - update watch - loop over all namespace-key combos 2 - update watch - loop over all the watchers of a specific namespace-key combo

the first is only an issue with large number of namespaces, i.e. large number of jobs the second is presumably only an issue with very large jobs, i.e. many job shells watching for updates for jobspec or R

note that these are somewhat orthogonal to each other

having many namespace-key combos probably means high throughput jobs (i.e. tons of small ones)

having many watchers of a specific namespace-key combo probably means a big job (i.e. a single job running on tons of cores)

so improving performance of one of these loops for their specific case, probably doesn't help with the other case

chu11 avatar Mar 18 '25 18:03 chu11

in a bit of a surprise, implementing another uuid+matchtag hash similar to what was done in #6695 and #6709 did not initially appear to help. The realization was that the major user of job-info.update-watch in the shell did not actually cancel the watcher.

Adding ...

--- a/src/shell/info.c
+++ b/src/shell/info.c
@@ -332,7 +332,20 @@ void shell_info_destroy (struct shell_info *info)
 {
     if (info) {
         int saved_errno = errno;
-        flux_future_destroy (info->R_watch_future);
+        if (info->R_watch_future) {
+            flux_t *h = flux_future_get_flux (info->R_watch_future);
+            flux_future_t *f;
+            f = flux_rpc_pack (h,
+                               "job-info.update-watch-cancel",
+                               FLUX_NODEID_ANY,
+                               FLUX_RPC_NORESPONSE,
+                               "{s:i}",
+                               "matchtag",
+                                 flux_rpc_get_matchtag (info->R_watch_future));
+            if (!f)
+                shell_log_error ("job-info.update-watch-cancel failed");
+            flux_future_destroy (info->R_watch_future);
+        }
         json_decref (info->R);
         jobspec_destroy (info->jobspec);
         rcalc_destroy (info->rcalc);

did seem to improve things ever so slightly.

chu11 avatar Mar 21 '25 02:03 chu11

while adding a hash to the job-info update-watch code appears to be a small net win, it is small enough to be questionable. Given the fact that the job shell doesn't even call cancel and the existence of #6721, it's a debatable win. Perhaps in the future if more update-watcher keys are supported (more than just R and jobspec) perhaps this will change.

For now, I will park the work here for consideration down the road

https://github.com/chu11/flux-core/tree/issue6990_kvs_watch_part_4

chu11 avatar Mar 21 '25 03:03 chu11

Shouldn't the watch be canceled by the disconnect when the shell exits and closes its handle?

grondo avatar Mar 21 '25 14:03 grondo

Shouldn't the watch be canceled by the disconnect when the shell exits and closes its handle?

Yes, perhaps I should have said it is not canceled "directly" but via the disconnect. B/c disconnects don't have matchtags, those can't be cleaned up quickly, the disconnects just loop through everything and see what should be cleaned up (see #6721).

chu11 avatar Mar 21 '25 14:03 chu11

Ah, ok. That makes sense now :-)

I wonder if there's a way to fix the disconnect case. There's probably lots of cases of that lingering around...

grondo avatar Mar 21 '25 16:03 grondo

yeah, when i discovered it I realized "oh crap, we could improve performance by NOT canceling watchers" .... to some loose extent the work to solve #6695 and #6709 made two iterations over watchers into 1 iteration. Since the update watcher for R in the job-shell was only doing 1 iteration to begin with, the addition of the cancel actually was a net loss.

chu11 avatar Mar 21 '25 18:03 chu11