mercury
mercury copied to clipboard
Mercury no longer returns an error when RPC id does not match
I have a test (using Margo) that includes a test case for sending an RPC with an incorrect hg_id_t
, not matching a registered RPC on the server. Mercury used to correctly return an error in such case (HG_NO_MATCH
in the info->ret
field of the callback argument), but with mercury 2.2.0 I'm not getting such an error anymore. The client is getting an HG_SUCCESS
and perceives the RPC as having executed.
Note: the test in question is here: the client registers the RPC with a different name than on the server, then creates a handle and forwards it, and gets an HG_SUCCESS from the server.
I'm not sure, we have a test for that. You can see it if you run ctest -R rpc -VV
. It does return HG_NOENTRY
as expected.
Can you point me to the specific test source? The margo test is pretty simple, so I don't know what changed in Mercury 2.2.0 that made it no longer pass.
I may not be looking at the right place, I looked here: https://github.com/mercury-hpc/mercury/blob/master/Testing/test_rpc.c but it seems that the only test that is expecting a HG_NOENTRY is one where the sender hasn't registered the RPC id provided. The use-case I'm describing is different: the sender has registered it, but not the receiver.
That should be similar to that: https://github.com/mercury-hpc/mercury/blob/master/Testing/test_rpc.c#L795 ? I think we are talking about the same use case?
I don't see anything different between that test and yours ?
I'm confused, here shouldn't it be hg_ret != HG_NOENTRY
rather than hg_ret != HG_SUCCESS
?
Ok there was a bug in my margo test, which is fixed now, and the test passes, so the problem is somewhere else. The margo test was meant to reproduce a problem I was seeing in Yokan (a problem that I still have), where when using mercury 2.2.0, forwarding to a wrong "provider id" (the 16 least significant bits of the RPC id) wasn't causing an HG_NO_MATCH/HG_NOENTRY to be returned. I still need to investigate why this happens, as I don't see the problem with mercury 2.1.0...
I'm confused, here shouldn't it be
hg_ret != HG_NOENTRY
rather thanhg_ret != HG_SUCCESS
?
ok yes we could do that, I think it would be better than what we do currently, we just check for it here: https://github.com/mercury-hpc/mercury/blob/master/Testing/test_rpc.c#L290
I'm not entirely sure about the provider id error you see though, are you using __attribute__((__packed__))
on your header ? we haven't really made any changes to anything related to these return codes etc afaik except using packed attributes instead of pragma pack but I would not expect it to change anything...
I'm confused, here shouldn't it be
hg_ret != HG_NOENTRY
rather thanhg_ret != HG_SUCCESS
?ok yes we could do that, I think it would be better than what we do currently, we just check for it here: https://github.com/mercury-hpc/mercury/blob/master/Testing/test_rpc.c#L290
I understand now why at the line I mentioned it's hg_ret != HG_SUCCESS
and not hg_ret != HG_NOENTRY
; it tests for the return value of hg_test_rpc
, not the callback's info->ret
value. But regarding the line you quoted, there is a difference between expecting an HG_NOENTRY
from HG_Create
and from the callback in HG_Forward
. In the former, the RPC hasn't been registered in the sender, and HG_Create
will simply fail to create a handle. In the latter, the RPC hasn't been registered on the destination and the HG_NOENTRY
is expected in info->ret
in the callback passed to HG_Forward
. In fact, I see here that HG_NOENTRY
is explicitly ignored. So the behavior I'm describing is effectively not tested.
It is tested, it just relies on the message being printed, you can see it when running the test but yes I agree with you it'd be cleaner to not ignore the error code and instead compare against it, I'll push a cleanup fix for it later on.
I bisected commits from 2.1.0 to 2.2.0 to get to the one that makes the problem appear in my code, and it's this one: https://github.com/mercury-hpc/mercury/commit/6fed707d9f2584cb71bbbd8ac083eed9a4ff4408
I don't know exactly what it does. In my setup Mercury is built via spack with those variants: ~bmi~boostsys+checksum+debug~hwloc~ipo~mpi+ofi~psm~psm2+shared+sm~ucx~udreg
.
I also tested with ~checksum
and the problem is still there.
Margo registers its RPCs by computing a 8-byte hash of the string name, then shifts this hash left 2 bytes to make space for a short provider id (see here). The resulting id is then used to register the RPC using HG_Register
.
My application registers an bunch of RPCs with a specific provider id (42). I then proceed to try forwarding an RPC to a provider id of 43. I know for a fact there is no matching handler on the server because all the registered RPC ids have their least 2 significant bytes representing 42, not 43. And yet, the forward function works fine and the result in its callback is HG_SUCCESS
. Nothing gets called on the server.
Ah I figured it out! To reproduce the problem, I need to register an RPC with a NULL handler (as if I were a client), then forward the RPC to myself. This returns HG_SUCCESS. This doesn't seem to be the case when sending an RPC to another process, only when sending to self.
I did a reproducer using Margo here and there is an issue in Margo to track the problem here. To summarize:
Sending an RPC to a destination that has associated this RPC with a NULL handler has a different behavior depending on whether the receiver is self or not. If the receiver is self, the sender will get an HG_SUCCESS
back, even though nothing was called. If the receiver is not self, the sender will get an HG_INVALID_ARG
error. In both situations, one would expect an HG_NOENTRY
.
The weird thing is that I can also reproduce the problem with Mercury 2.1.0 in Margo, but in another library I had no issue until Mercury 2.2.0.
I think having HG_INVALID_ARG
being returned makes sense for that particular case no ? HG_NOENTRY
means that it didn't find the RPC for that ID but if you have associated a NULL handler, that's another story imo, it means you have passed an invalid argument to your register routine. I'll look at the self send case.
Wrt the RPC ID issue that you have, which ID do you receive on the server end when you send ID=43 ? still 43 or some other random value ?
Wrt the RPC ID issue that you have, which ID do you receive on the server end when you send ID=43 ? still 43 or some other random value ?
In the cases that are failing, nothing gets called on the server. It's really not an RPC ID problem, the problem is getting back HG_SUCCESS when the RPC was registered with NULL on the server. I'm fine with HG_INVALID_ARG, at least it's an error code that's not HG_SUCCESS. However in Margo this means that we will have 2 different errors depending on whether the RPC is sent to self (in which case it because of the way margo works, it will end up sending to a NULL handler) or not, hence my preference of an HG_NOENTRY.
Cannot reproduce anymore, this must have been fixed in previous commits. Re-open if you still have issues on this topic.