proposing remote action invocations for low latency communication
Currently there are two ways for actions to be handled
async(policy=async || policy==sync, action, locality, args...)
This is an asynchronous call that is executed remotely and returns a value via a future. Both the local and remote ends of the action are queued by the scheduler so that parcels are sent when the scheduler next completes a task. Likewise when received and decoded, actions are invoked when the scheduler next is available. i.e. no tasks are interrupted.
async(policy=fork, action, locality, args...)
This action is dispatched immediately using the new yield_to feature. At the receiving end, the action is currently queued as before. We should investigate whether the task currently handling the parcelport can yield_to the decoded action and invoke it immediately to provide a symmetric implementation that sends immediately and returns a future immediately (or at least bypassing queues). If a new launch::policy is required to differentiate between an immediate/lazy return, then one should be added.
To enhance these, proposal - draft 1
async(policy=fork, action=an_rdma_action_type, locality, rdma_object_Id, ...)
This action bypasses the normal parcel send process and is only valid for trivially_copyable types (or other rdma friendly type that might be supported/specialized). Data is transferred as immediately as possible to the target location which is provided by an Id argument referencing a remote rdma_object<T> component that can be queried for rdma memory handles &etc. A future is returned that becomes ready when the rdma operation completes. rdma_objects would need to be queried beforehand to obtain the Id using the existing infrastructure (AGAS component registration &etc). The remote end of the rdma_action would set an atomic_bool flag (write or read) whenever it writes or reads from/to the remote memory. The flags would be defined/contained by the base action type in addition to the <T> stored within. The rdma_action would support partial read/writes from remote memory so that an rdma_objectstd::array[N] could have subsets (n<N) of elements read/written from/to at arbitrary offsets.
async(policy=fork, action=an_rdma_action_type, locality, rdma_object_Id, signal_Id/index...)
This is the same type of rdma as the previous example, however the rdma_object<T> can support a number of extra signals which could be simple atomic_bool flags or futures that can be made ready. The remote end of each rdma operation would have the flag/future for the given index set when each operation takes place. A policy would need to be defined for what action to take in the even that 2 or more rdma operations to the same index occurred before the receiving end clears the state. The signals would need to support set/reset so that they could be used multiple times and the remote end of the action would need a mechanism to obtain a new future after each use (iteration for example). Using an atomic_bool with N slots would potentially make implementation of certain collective operations easier. An rdma array with O(N) elements/signals could be mapped to localities for certain checks.
The rdma_action_type is a base class that provides the put/get operations, (memory registration via the rdma_object type) and signalling of flags and the user would instantiate their own actions templated over the required data type.
Note. I did not say it, but it is assumed that the receiving end of an rdma_action would receive a message that an rdma_operation had taken place and the reciving parcelport would be directly responsible for setting flag/future states without going via the usual parcel decode and action invocation.
Note 2. The remote end of the action could also register actions to be triggered by index whenever an rdma operations take place, so that the fist rdma_action signature would trigger an action remotely on each completion, whereas the second would trigger an action with the signalled index as a param - or potentially a different action per index
Instead of adding new special arguments to async, wouldn't it be possible to have something like:
async(policy, an_rdma_action_type(rdma_object_id), locality, ...)
or
async(policy, an_rdma_action_type(), rdma_destination(locality, rdma_object_id), ...)
?
That looks nicer (I like the second one, slightly more than the first, but I'm not sure why)
The second one is less desirable, though, as it splits the notion of sending something through the RDMA extensions and requires the user to specify it through two of the arguments to async. We would have to answer questions like 'What should happen if only one is specified?'...
I've changed my mind, I like the first better.
further discussion on this topic from IRC might clarify some points.
heller 1) Why do you think we need to bypass serialization? What are your hopes there? What do you think makes the serialization layer slow?
jbjnr I'm not particularly concerned about serialization because for trivial types it is very fast. I'm concerned about the way actions are dispatched/handled.
heller right
jbjnr if the symmetric mode of the async(fork, ...) were possible, it might be that this would be fast enough to allow us to do most of what I want without the rdma actions,
heller so how are your rdma_actions going to solve this?
heller what do you mean with symmetric mode?
jbjnr but regardless of that, the ETHZ people will not buy it unless there is also an rdma approach as well.
jbjnr in the symetric mode, we need to bypass the queues at the receive end as well as the send end
jbjnr (so the parcelport can invoke the action directly, rather than queuing it when it is received).
heller ok, that's possible
heller but has nothing to do with rdma in particular
heller what I don't like at all about what you describe, is that you essentially bypass any abstractions
heller actions are just descriptions of work to be executed
heller RDMA approach == setting/getting memory in a one sided fashion without executing work at all?
jbjnr we keep the abstraction of an 'action' on the send side, but bypass the idea of parcels. there are applications wherfe we do not need to trigger an 'action' on the receive side, just copy some data and maybe set a flag, which th PP can do.
jbjnr synchronization is more of a problem, but that's why I started writing the issue, so we can devise a system that would work. I'm open to changes.
jbjnr the ETHZ want a task based RDMA system and they don't givve a shite about our action abstractions.
heller what do you think a parcel is?
jbjnr they want to write C style code with rdma and tasks and they are not going to send maps of vectors of structs from node to node
heller sure
jbjnr the parcel is a archive + the action
jbjnr I just want to remove the acion at the remote end for certain ops
heller I think this is where the misconception lies
heller the parcel itself is more or less just the content of a message
heller and that is: destination, action, action arguments
heller more or less
heller oh, and of course the continuation
heller the continuation is then what sets the remote flag (which of course triggers a parcel again)
heller so yes, with your symmetric way, and the data_action example, we have everything there already, IMHO
heller what's missing is a way to tell the PP, when you got this action, here is the memory where you can directly get into, which was preregistered
jbjnr this is what the rdma_object<T> would provide, the action type would know how to query the object to get the memory key, (which was delivered earlier)
heller so is the rdma_object a component then?
jbjnr actually, we'd need 2. to copy from one localy, to one remotely ideally.
jbjnr yes it is a component
jbjnr a lightweight one if we can
heller not a big problem
heller so what's missing is to post a memory buffer to the PP, and a mechanism to match a received parcel with any posted memory buffer
heller correct?
jbjnr so each node would create an rdma_object<T, register it, then node A can get the rdma_object<T> Id from node B by querying agas and then do RDMA operations to from it, using that id. The rdma action is just my shortcut t allowing us to by pass certain parcelport stuff
jbjnr yes I think so ^^
heller which we really shouldn't need
heller the bypassing that is
jbjnr the bypassing allows me to do different rdma operations in the custom action, than I do in the parcel action
heller which would be?
jbjnr a different entrypoint into the parcel send stuff
jbjnr gtg meeting in 1 min
heller what would it do differently to the "normal" entry point?
jbjnr rdma first, then send message after. currently we send header then do rdma once remote end has initialized things. here the things are inited first (rdma_object registration), then all we need t do is rdma, after if synchronization is requested, send a smal sync message to set flags at the remote end etc and rturn the done future without waiting for a return message etc).
jbjnr going now
heller jbjnr: I see, well, that sounds plausible
heller jbjnr: we only need to pay extra attention to not build a fabric specific solutoin here
jbjnr heller: agreed, the TCP PP would have to drop back to normal actions and perf would suck bad, but as long as we warn users ....
heller jbjnr: I still think we don't need a special action type. I think posting a receive buffer would be good enough
heller that way, we can build data structures that benefit from pre posted parcels, so to say
jbjnr if we can keep rdma_object<T> and when that is passed as a parameter to a remote action - have the PP take a different route for the send/receive etc, then fine, but I simply don't know enough about the whole stack to be confident that I can make it work. Providing a new action type makes it a bit cleaner and easier for me. If I make this work and then you wrap it back into the old action...
jbjnr ...types, that'd be fine
heller sure
heller let's make it work first
heller yeah
heller jbjnr: but how about making it a priority to actually get the ibverbs PP in?
jbjnr I was hoping nobody would say that
jbjnr (about getting verbs in)
heller I think that's very important
heller without that, all the talking about special rdma handling is almost void
jbjnr give me a deadline and I'll aim for then
jbjnr I need to put back shared receive queues, fix the startup connection and clean up the poll stuff.
heller jbjnr: end of the month?
jbjnr ok. I'll do it
heller yeha!