hpx icon indicating copy to clipboard operation
hpx copied to clipboard

proposing remote action invocations for low latency communication

Open biddisco opened this issue 9 years ago • 7 comments

Currently there are two ways for actions to be handled

async(policy=async || policy==sync, action, locality, args...)

This is an asynchronous call that is executed remotely and returns a value via a future. Both the local and remote ends of the action are queued by the scheduler so that parcels are sent when the scheduler next completes a task. Likewise when received and decoded, actions are invoked when the scheduler next is available. i.e. no tasks are interrupted.

async(policy=fork, action, locality, args...)

This action is dispatched immediately using the new yield_to feature. At the receiving end, the action is currently queued as before. We should investigate whether the task currently handling the parcelport can yield_to the decoded action and invoke it immediately to provide a symmetric implementation that sends immediately and returns a future immediately (or at least bypassing queues). If a new launch::policy is required to differentiate between an immediate/lazy return, then one should be added.

To enhance these, proposal - draft 1

async(policy=fork, action=an_rdma_action_type, locality, rdma_object_Id, ...)

This action bypasses the normal parcel send process and is only valid for trivially_copyable types (or other rdma friendly type that might be supported/specialized). Data is transferred as immediately as possible to the target location which is provided by an Id argument referencing a remote rdma_object<T> component that can be queried for rdma memory handles &etc. A future is returned that becomes ready when the rdma operation completes. rdma_objects would need to be queried beforehand to obtain the Id using the existing infrastructure (AGAS component registration &etc). The remote end of the rdma_action would set an atomic_bool flag (write or read) whenever it writes or reads from/to the remote memory. The flags would be defined/contained by the base action type in addition to the <T> stored within. The rdma_action would support partial read/writes from remote memory so that an rdma_objectstd::array[N] could have subsets (n<N) of elements read/written from/to at arbitrary offsets.

async(policy=fork, action=an_rdma_action_type, locality, rdma_object_Id, signal_Id/index...)

This is the same type of rdma as the previous example, however the rdma_object<T> can support a number of extra signals which could be simple atomic_bool flags or futures that can be made ready. The remote end of each rdma operation would have the flag/future for the given index set when each operation takes place. A policy would need to be defined for what action to take in the even that 2 or more rdma operations to the same index occurred before the receiving end clears the state. The signals would need to support set/reset so that they could be used multiple times and the remote end of the action would need a mechanism to obtain a new future after each use (iteration for example). Using an atomic_bool with N slots would potentially make implementation of certain collective operations easier. An rdma array with O(N) elements/signals could be mapped to localities for certain checks.

The rdma_action_type is a base class that provides the put/get operations, (memory registration via the rdma_object type) and signalling of flags and the user would instantiate their own actions templated over the required data type.

biddisco avatar Aug 10 '16 22:08 biddisco

Note. I did not say it, but it is assumed that the receiving end of an rdma_action would receive a message that an rdma_operation had taken place and the reciving parcelport would be directly responsible for setting flag/future states without going via the usual parcel decode and action invocation.

biddisco avatar Aug 10 '16 22:08 biddisco

Note 2. The remote end of the action could also register actions to be triggered by index whenever an rdma operations take place, so that the fist rdma_action signature would trigger an action remotely on each completion, whereas the second would trigger an action with the signalled index as a param - or potentially a different action per index

biddisco avatar Aug 10 '16 23:08 biddisco

Instead of adding new special arguments to async, wouldn't it be possible to have something like:

 async(policy, an_rdma_action_type(rdma_object_id), locality, ...)

or

 async(policy, an_rdma_action_type(), rdma_destination(locality, rdma_object_id), ...)

?

hkaiser avatar Aug 10 '16 23:08 hkaiser

That looks nicer (I like the second one, slightly more than the first, but I'm not sure why)

biddisco avatar Aug 11 '16 07:08 biddisco

The second one is less desirable, though, as it splits the notion of sending something through the RDMA extensions and requires the user to specify it through two of the arguments to async. We would have to answer questions like 'What should happen if only one is specified?'...

hkaiser avatar Aug 11 '16 14:08 hkaiser

I've changed my mind, I like the first better.

biddisco avatar Aug 11 '16 14:08 biddisco

further discussion on this topic from IRC might clarify some points.

heller  1) Why do you think we need to bypass serialization? What are your hopes there? What do you think makes the serialization layer slow?
jbjnr   I'm not particularly concerned about serialization because for trivial types it is very fast. I'm concerned about the way actions are dispatched/handled. 
heller  right
jbjnr   if the symmetric mode of the async(fork, ...) were possible, it might be that this would be fast enough to allow us to do most of what I want without the rdma actions,
heller  so how are your rdma_actions going to solve this?
heller  what do you mean with symmetric mode?
jbjnr   but regardless of that, the ETHZ people will not buy it unless there is also an rdma approach as well.
jbjnr   in the symetric mode, we need to bypass the queues at the receive end as well as the send end
jbjnr   (so the parcelport can invoke the action directly, rather than queuing it when it is received).
heller  ok, that's possible
heller  but has nothing to do with rdma in particular
heller  what I don't like at all about what you describe, is that you essentially bypass any abstractions
heller  actions are just descriptions of work to be executed
heller  RDMA approach == setting/getting memory in a one sided fashion without executing work at all?
jbjnr   we keep the abstraction of an 'action' on the send side, but bypass the idea of parcels. there are applications wherfe we do not need to trigger an 'action' on the receive side, just copy some data and maybe set a flag, which th PP can do.
jbjnr   synchronization is more of a problem, but that's why I started writing the issue, so we can devise a system that would work. I'm open to changes.
jbjnr   the ETHZ want a task based RDMA system and they don't givve a shite about our action abstractions.
heller  what do you think a parcel is?
jbjnr   they want to write C style code with rdma and tasks and they are not going to send maps of vectors of structs from node to node
heller  sure
jbjnr   the parcel is a archive + the action
jbjnr   I just want to remove the acion at the remote end for certain ops
heller  I think this is where the misconception lies
heller  the parcel itself is more or less just the content of a message
heller  and that is: destination, action, action arguments
heller  more or less
heller  oh, and of course the continuation
heller  the continuation is then what sets the remote flag (which of course triggers a parcel again)
heller  so yes, with your symmetric way, and the data_action example, we have everything there already, IMHO
heller  what's missing is a way to tell the PP, when you got this action, here is the memory where you can directly get into, which was preregistered
jbjnr   this is what the rdma_object<T> would provide, the action type would know how to query the object to get the memory key, (which was delivered earlier)
heller  so is the rdma_object a component then?
jbjnr   actually, we'd need 2. to copy from one localy, to one remotely ideally.
jbjnr   yes it is a component
jbjnr   a lightweight one if we can
heller  not a big problem
heller  so what's missing is to post a memory buffer to the PP, and a mechanism to match a received parcel with any posted memory buffer
heller  correct?
jbjnr   so each node would create an rdma_object<T, register it, then node A can get the rdma_object<T> Id from node B by querying agas and then do RDMA operations to from it, using that id. The rdma action is just my shortcut t allowing us to by pass certain parcelport stuff
jbjnr   yes I think so ^^
heller  which we really shouldn't need
heller  the bypassing that is
jbjnr   the bypassing allows me to do different rdma operations in the custom action, than I do in the parcel action
heller  which would be?
jbjnr   a different entrypoint into the parcel send stuff
jbjnr   gtg meeting in 1 min
heller  what would it do differently to the "normal" entry point?
jbjnr   rdma first, then send message after. currently we send header then do rdma once remote end has initialized things. here the things are inited first (rdma_object registration), then all we need t do is rdma, after if synchronization is requested, send a smal sync message to set flags at the remote end etc and rturn the done future without waiting for a return message etc).
jbjnr   going now
heller  jbjnr: I see, well, that sounds plausible
heller  jbjnr: we only need to pay extra attention to not build a fabric specific solutoin here
jbjnr   heller: agreed, the TCP PP would have to drop back to normal actions and perf would suck bad, but as long as we warn users ....
heller  jbjnr: I still think we don't need a special action type. I think posting a receive buffer would be good enough
heller  that way, we can build data structures that benefit from pre posted parcels, so to say
jbjnr   if we can keep rdma_object<T> and when that is passed as a parameter to a remote action - have the PP take a different route for the send/receive etc, then fine, but I simply don't know enough about the whole stack to be confident that I can make it work. Providing a new action type makes it a bit cleaner and easier for me. If I make this work and then you wrap it back into the old action...
jbjnr   ...types, that'd be fine
heller  sure
heller  let's make it work first
heller  yeah
heller  jbjnr: but how about making it a priority to actually get the ibverbs PP in?
jbjnr   I was hoping nobody would say that
jbjnr   (about getting verbs in)
heller  I think that's very important
heller  without that, all the talking about special rdma handling is almost void
jbjnr   give me a deadline and I'll aim for then
jbjnr   I need to put back shared receive queues, fix the startup connection and clean up the poll stuff.
heller  jbjnr: end of the month?
jbjnr   ok. I'll do it
heller  yeha!

biddisco avatar Aug 11 '16 14:08 biddisco