flux-core kvs: uncompleted RPCs should receive ENOSYS response when module is unloaded

Problem: we ran flux module reload kvs on a live system and it seemed evident that pending commit requests did not receive an error.

I had thought that rpc_track would ensure that pending RPCs get an ENOSYS when the module handling them is unloaded but that may not be the case. It probably should be so that requests don't hang forever.

May 17 '24 01:05 garlick

Yeah we don't do this for modules. I was confusing this with the generation of <service>.disconnect messages when a module unloads so that any services it is using are notified that it's not there anymore.

I think the rationale was that modules should do their own responding to RPCs they are tracking since they do have the opportunity to do so. Changing the name of this issue to target the KVS module.

May 17 '24 01:05 garlick

began looking at this and I think the commit side of this is more doable b/c we "track" the requests within the transaction manager already.

However, on the lookup/get side, we do not track current requests at all, making this more annoying/difficult.

Jun 17 '24 19:06 chu11

Oh this might get slightly ugly since like to "replay" lookups that are stalled for some reason. We'd have to ensure a lookup isn't already in the msglist or whatever.

Jun 17 '24 19:06 garlick

Oh this might get slightly ugly since like to "replay" lookups that are stalled for some reason. We'd have to ensure a lookup isn't already in the msglist or whatever.

Doh! And stalls/replays also happen on the commit side, I didn't realize till just now

    if (!(root = getroot (ctx, ns, mh, msg, NULL, &stall))) {                                                                               
        if (stall)                                                                                                                          
            return;                                                                                                                         
        goto error;                                                                                                                         
    }                                                                                                                                       
                                                                                                                                            
    if (!(tr = treq_create_rank (ctx->rank, ctx->seq++, 1, flags))) {                                                                       
        flux_log_error (h, "%s: treq_create_rank", __FUNCTION__);                                                                           
        goto error;                                                                                                                         
    }

Perhaps a running msglist of all uncompleted requests is probably the simplest way to handle this on both sides.

Edit: oh wait, msglist is maybe not good given how long the requests could be with the KVS. This isn't like disconnect where it's a presumed rare activity. We'll adding/removing all the time. hmmmm

Jun 17 '24 21:06 chu11

hmmm should we be handling this in a lot more modules than we do? Just skimmed kvs-watch and job-info and I don't think we handle it there either.

Jun 19 '24 23:06 chu11

flux-core flux-core copied to clipboard

kvs: uncompleted RPCs should receive ENOSYS response when module is unloaded

flux-core
flux-core copied to clipboard