flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

kvs: uncompleted RPCs should receive ENOSYS response when module is unloaded

Open garlick opened this issue 1 year ago • 5 comments

Problem: we ran flux module reload kvs on a live system and it seemed evident that pending commit requests did not receive an error.

I had thought that rpc_track would ensure that pending RPCs get an ENOSYS when the module handling them is unloaded but that may not be the case. It probably should be so that requests don't hang forever.

garlick avatar May 17 '24 01:05 garlick

Yeah we don't do this for modules. I was confusing this with the generation of <service>.disconnect messages when a module unloads so that any services it is using are notified that it's not there anymore.

I think the rationale was that modules should do their own responding to RPCs they are tracking since they do have the opportunity to do so. Changing the name of this issue to target the KVS module.

garlick avatar May 17 '24 01:05 garlick

began looking at this and I think the commit side of this is more doable b/c we "track" the requests within the transaction manager already.

However, on the lookup/get side, we do not track current requests at all, making this more annoying/difficult.

chu11 avatar Jun 17 '24 19:06 chu11

Oh this might get slightly ugly since like to "replay" lookups that are stalled for some reason. We'd have to ensure a lookup isn't already in the msglist or whatever.

garlick avatar Jun 17 '24 19:06 garlick

Oh this might get slightly ugly since like to "replay" lookups that are stalled for some reason. We'd have to ensure a lookup isn't already in the msglist or whatever.

Doh! And stalls/replays also happen on the commit side, I didn't realize till just now

    if (!(root = getroot (ctx, ns, mh, msg, NULL, &stall))) {                                                                               
        if (stall)                                                                                                                          
            return;                                                                                                                         
        goto error;                                                                                                                         
    }                                                                                                                                       
                                                                                                                                            
    if (!(tr = treq_create_rank (ctx->rank, ctx->seq++, 1, flags))) {                                                                       
        flux_log_error (h, "%s: treq_create_rank", __FUNCTION__);                                                                           
        goto error;                                                                                                                         
    }                                                                                                                                       

Perhaps a running msglist of all uncompleted requests is probably the simplest way to handle this on both sides.

Edit: oh wait, msglist is maybe not good given how long the requests could be with the KVS. This isn't like disconnect where it's a presumed rare activity. We'll adding/removing all the time. hmmmm

chu11 avatar Jun 17 '24 21:06 chu11

hmmm should we be handling this in a lot more modules than we do? Just skimmed kvs-watch and job-info and I don't think we handle it there either.

chu11 avatar Jun 19 '24 23:06 chu11