mercury icon indicating copy to clipboard operation
mercury copied to clipboard

Mercury no longer returns an error when RPC id does not match

Open mdorier opened this issue 2 years ago • 17 comments

I have a test (using Margo) that includes a test case for sending an RPC with an incorrect hg_id_t, not matching a registered RPC on the server. Mercury used to correctly return an error in such case (HG_NO_MATCH in the info->ret field of the callback argument), but with mercury 2.2.0 I'm not getting such an error anymore. The client is getting an HG_SUCCESS and perceives the RPC as having executed.

Note: the test in question is here: the client registers the RPC with a different name than on the server, then creates a handle and forwards it, and gets an HG_SUCCESS from the server.

mdorier avatar Aug 22 '22 13:08 mdorier

I'm not sure, we have a test for that. You can see it if you run ctest -R rpc -VV. It does return HG_NOENTRY as expected.

soumagne avatar Aug 22 '22 15:08 soumagne

Can you point me to the specific test source? The margo test is pretty simple, so I don't know what changed in Mercury 2.2.0 that made it no longer pass.

mdorier avatar Aug 22 '22 16:08 mdorier

I may not be looking at the right place, I looked here: https://github.com/mercury-hpc/mercury/blob/master/Testing/test_rpc.c but it seems that the only test that is expecting a HG_NOENTRY is one where the sender hasn't registered the RPC id provided. The use-case I'm describing is different: the sender has registered it, but not the receiver.

mdorier avatar Aug 22 '22 16:08 mdorier

That should be similar to that: https://github.com/mercury-hpc/mercury/blob/master/Testing/test_rpc.c#L795 ? I think we are talking about the same use case?

soumagne avatar Aug 22 '22 16:08 soumagne

I don't see anything different between that test and yours ?

soumagne avatar Aug 22 '22 16:08 soumagne

I'm confused, here shouldn't it be hg_ret != HG_NOENTRY rather than hg_ret != HG_SUCCESS ?

mdorier avatar Aug 22 '22 18:08 mdorier

Ok there was a bug in my margo test, which is fixed now, and the test passes, so the problem is somewhere else. The margo test was meant to reproduce a problem I was seeing in Yokan (a problem that I still have), where when using mercury 2.2.0, forwarding to a wrong "provider id" (the 16 least significant bits of the RPC id) wasn't causing an HG_NO_MATCH/HG_NOENTRY to be returned. I still need to investigate why this happens, as I don't see the problem with mercury 2.1.0...

mdorier avatar Aug 22 '22 18:08 mdorier

I'm confused, here shouldn't it be hg_ret != HG_NOENTRY rather than hg_ret != HG_SUCCESS ?

ok yes we could do that, I think it would be better than what we do currently, we just check for it here: https://github.com/mercury-hpc/mercury/blob/master/Testing/test_rpc.c#L290

soumagne avatar Aug 22 '22 19:08 soumagne

I'm not entirely sure about the provider id error you see though, are you using __attribute__((__packed__)) on your header ? we haven't really made any changes to anything related to these return codes etc afaik except using packed attributes instead of pragma pack but I would not expect it to change anything...

soumagne avatar Aug 22 '22 19:08 soumagne

I'm confused, here shouldn't it be hg_ret != HG_NOENTRY rather than hg_ret != HG_SUCCESS ?

ok yes we could do that, I think it would be better than what we do currently, we just check for it here: https://github.com/mercury-hpc/mercury/blob/master/Testing/test_rpc.c#L290

I understand now why at the line I mentioned it's hg_ret != HG_SUCCESS and not hg_ret != HG_NOENTRY; it tests for the return value of hg_test_rpc, not the callback's info->ret value. But regarding the line you quoted, there is a difference between expecting an HG_NOENTRY from HG_Create and from the callback in HG_Forward. In the former, the RPC hasn't been registered in the sender, and HG_Create will simply fail to create a handle. In the latter, the RPC hasn't been registered on the destination and the HG_NOENTRY is expected in info->ret in the callback passed to HG_Forward. In fact, I see here that HG_NOENTRY is explicitly ignored. So the behavior I'm describing is effectively not tested.

mdorier avatar Aug 23 '22 11:08 mdorier

It is tested, it just relies on the message being printed, you can see it when running the test but yes I agree with you it'd be cleaner to not ignore the error code and instead compare against it, I'll push a cleanup fix for it later on.

soumagne avatar Aug 23 '22 17:08 soumagne

I bisected commits from 2.1.0 to 2.2.0 to get to the one that makes the problem appear in my code, and it's this one: https://github.com/mercury-hpc/mercury/commit/6fed707d9f2584cb71bbbd8ac083eed9a4ff4408

I don't know exactly what it does. In my setup Mercury is built via spack with those variants: ~bmi~boostsys+checksum+debug~hwloc~ipo~mpi+ofi~psm~psm2+shared+sm~ucx~udreg.

I also tested with ~checksum and the problem is still there.

Margo registers its RPCs by computing a 8-byte hash of the string name, then shifts this hash left 2 bytes to make space for a short provider id (see here). The resulting id is then used to register the RPC using HG_Register.

My application registers an bunch of RPCs with a specific provider id (42). I then proceed to try forwarding an RPC to a provider id of 43. I know for a fact there is no matching handler on the server because all the registered RPC ids have their least 2 significant bytes representing 42, not 43. And yet, the forward function works fine and the result in its callback is HG_SUCCESS. Nothing gets called on the server.

mdorier avatar Aug 25 '22 10:08 mdorier

Ah I figured it out! To reproduce the problem, I need to register an RPC with a NULL handler (as if I were a client), then forward the RPC to myself. This returns HG_SUCCESS. This doesn't seem to be the case when sending an RPC to another process, only when sending to self.

mdorier avatar Aug 25 '22 11:08 mdorier

I did a reproducer using Margo here and there is an issue in Margo to track the problem here. To summarize:

Sending an RPC to a destination that has associated this RPC with a NULL handler has a different behavior depending on whether the receiver is self or not. If the receiver is self, the sender will get an HG_SUCCESS back, even though nothing was called. If the receiver is not self, the sender will get an HG_INVALID_ARG error. In both situations, one would expect an HG_NOENTRY.

The weird thing is that I can also reproduce the problem with Mercury 2.1.0 in Margo, but in another library I had no issue until Mercury 2.2.0.

mdorier avatar Aug 25 '22 14:08 mdorier

I think having HG_INVALID_ARG being returned makes sense for that particular case no ? HG_NOENTRY means that it didn't find the RPC for that ID but if you have associated a NULL handler, that's another story imo, it means you have passed an invalid argument to your register routine. I'll look at the self send case.

soumagne avatar Aug 26 '22 17:08 soumagne

Wrt the RPC ID issue that you have, which ID do you receive on the server end when you send ID=43 ? still 43 or some other random value ?

soumagne avatar Aug 26 '22 17:08 soumagne

Wrt the RPC ID issue that you have, which ID do you receive on the server end when you send ID=43 ? still 43 or some other random value ?

In the cases that are failing, nothing gets called on the server. It's really not an RPC ID problem, the problem is getting back HG_SUCCESS when the RPC was registered with NULL on the server. I'm fine with HG_INVALID_ARG, at least it's an error code that's not HG_SUCCESS. However in Margo this means that we will have 2 different errors depending on whether the RPC is sent to self (in which case it because of the way margo works, it will end up sending to a NULL handler) or not, hence my preference of an HG_NOENTRY.

mdorier avatar Aug 26 '22 21:08 mdorier

Cannot reproduce anymore, this must have been fixed in previous commits. Re-open if you still have issues on this topic.

soumagne avatar Jun 06 '23 20:06 soumagne