mercury icon indicating copy to clipboard operation
mercury copied to clipboard

Add support for core affinity within NA / HG

Open soumagne opened this issue 9 years ago • 7 comments

Currently all the RPC requests posted are initially attached to a single core, this will require some work so that requests can be distributed over multiple cores and progress can be done on requests that are bound to a specific core. Additionally the NA layer and plugin may also need to provide specific callbacks to facilitate that progress and expose core affinity of requests. We should look at how libfabric handles that.

soumagne avatar Feb 03 '16 04:02 soumagne

Must allow things like: http://ofiwg.github.io/libfabric/master/man/fi_psm.7.html

soumagne avatar Feb 05 '16 05:02 soumagne

Hi, For the core affinity, our basic considerations are:

  1. Upper layer creates different affinity contexts when init, for example one affinity context for one NUMA node.
  2. One affinity context has one hg_context and na_context. Upper layer call _progress respectively for every context.
  3. Bind the thread and memory affinity to the context.

One problem is for the RPC server-side, it will be nice if we can select different contexts for different RPC requests from the beginning of receiving the RPC request to the end of handling the RPC and Response. Can the below thoughts be satisfied?

  1. select the context by an field inside the RPC header for example a 8 bytes(or 16/32 bytes) length cookie value.
  2. Client-side can specify the cookie to send the RPC request, and when server-side receiving the RPC request it take a peek at the cookie value in the header and then call a upper layer registered callback to hash the RPC request to one context and add it to the queue of that context.
  3. when upper layer call the _progress on specific context, it only handle the RPC request hashed to that context. (the above step 2 may also be handled when upper layer call the _progress)
  4. at above steps we want to minimize the executing context switch, remote memory accessing to other NUMA node, and memory copy. (can be totally eliminated? it maybe difficult as we need to firstly receive the RPC request and then check the cookie, but after receiving package it maybe already on other NUMA node which is different with hashing the cookie...)

Do you think the above requirement reasonable for core affinity handling?

Another related problem is that the current data structure inside mercury, one hg_context includes one hg_class pointer, and one hg_class include one na_context and one na_class, related queue and op_list bonded to hg_context and na private class. But the RPC hash table (func_map) is inside hg_class, so if we use different hg_context/hg_class/na_context/na_class for different affinity context, it will cause the RPC ID space be split as each hg_class has one RPC hash table. So maybe related internal data structures need to be reorganized.

Thanks.

liuxuezhao avatar Feb 18 '16 09:02 liuxuezhao

The MICA paper goes into hardware-assisted affinity management in the context of key-value stores (https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/lim). I'm sure there are plenty of other examples in the literature we can take inspiration from, including the works that MICA cites. Is there an LNET paper? They had done a lot of this, haven't they?

I'm wondering what the potential issues are in having multiple threads (and message queues) hitting on the same network endpoint (MICA, for instance, uses different ports for different cores and client-aware core-routing). Currently, progression is exclusive and tied to a single context, so there can be a good deal of contention when you have a multithreaded service grabbing network events. There's also been some question about what the semantics for multi-context, single-class are and how to properly use them. Because of this, for our prototyping activities we've been simply dedicating a single thread to progress mercury, followed by a thread handoff for RPC handling.

Under your proposed method, is progression no longer tied to a context (and step 3 refers to trigger rather than progress)? I.e. the context mapping would be done internally by Mercury? Or would we have a set of ports representing a single Mercury endpoint, have each service thread bound to a port/core/numa domain, and have clients/servers agree on a port mapping semantic (essentially what MICA does)?

What's "the upper layer" in this case?

JohnPJenkins avatar Feb 18 '16 15:02 JohnPJenkins

For affinity of lnet, you can find some slides by googling it. There is a wiki page about it as well: https://wiki.hpdd.intel.com/display/PUB/MDS+SMP+Node+Affinity+High+Level+Design It basically falls into the similar method as the proposal above (discussed that with lnet developer).

For the issue of "having multiple threads (and message queues) hitting on the same network endpoint", the basic method may could fall in binding executing thread and memory to the affinity context, avoid extra context switch, remote memory accessing and memory copy. However, there is a problem of how to dispatch RPC requests to different affinity contexts at very beginning. What we proposed for this is adding a "cookie" filed in RPC header, at RPC server-side the mercury's user can register a hash callback (input the cookie value and output the context ID) then mercury can call it to dispatch different RPC request to different context's queue. Here is still a problem that mercury needs to firstly receive the RPC request down and then lookup the cookie and hash it (by calling the hash callback), after receiving the RPC request the package already in host memory and possibly be different with the desired affinity context.The method in the paper you post try to resolve it by similar design as Receiver-Side Scaling and Flow Director but depends on specific hardware feature. Not sure if the low layer driver such as OFI (libfabrics) supports similar mechanism, if no then seems fine to just receive the RPC request down and lookup the cookie in header and dispatch to different context, then later's bulk and RPC handling can be well affinity.

The "upper layer" is just mercury's caller (for example DAOS or other FS). The progression is still tied to a context (each context has its own completion queue, unexpected op queue etc, upper layer's progress maybe a combination of hg_progress and hg_trigger), and upper layer will progress different contexts respectively. Seems the "port" in your description is similar as "context", then yes we would have a set of contexts for a single mercury endpoint, and have each service thread bound to context (one context for one NUMA node/domain). But we may don't need client and server agree on the context mapping, we only dispatch different RPC request to different context based on specific field in header and caller registered hash callback.

Not sure if the above clear and make sense? thanks.

liuxuezhao avatar Feb 19 '16 07:02 liuxuezhao

Most of that does make sense, thanks.

Based on a quick look (https://ofiwg.github.io/libfabric/), OFI does have tag matching. You probably are already thinking this, but it seems like the matching mechanism should probably be part of the NA API, right? It's unclear what the mechanism would be for other transports (e.g., cci), but you could for instance provide a default matching implementation that does tag matching by hand, through the hashing mechanism you describe.

RE my point about sharing the network endpoint between multiple cores, are you thinking about having a dedicated thread pulling things from the network and dispatching? The issue I was thinking about is concurrency problems with having a bunch of threads contend on network buffers, e.g. send and recv syscalls to the same port - in that case, everyone's stepping on each other's toes grabbing messages, doing tag matching and inserting into other's queues. The multiple ports thought is just a hack around that - have each core maintain a network connection that it progresses on independently, with some RPC-caller-side knowledge of which port to use. Of course, high-performance networks have the capability to dispatch in the hardware (OFI, the approach used in MICA, etc. - I'm sure IB verbs has something along these lines...).

JohnPJenkins avatar Feb 19 '16 15:02 JohnPJenkins

@JohnPJenkins thanks for sharing that paper. Yes I don't know if we want to enable that behavior for all the plugins, because as you say there may be concurrency issues. I would have thought that this is a feature that will mostly serve future plugins like the libfabric plugin. @liuxuezhao That makes sense to me as well and the direction we want to go. And you're right there will need to be some internal reorganization to handle that. The NA plugin will need to provide information back to the upper layer. Changes made in the NA layer to make it allocate the buffers (instead of having HG allocating the buffer) will also help I think.

soumagne avatar Feb 19 '16 17:02 soumagne

First version added in 0af463e3eff24ce1fecb4ee13a43b93ee78a23ba

soumagne avatar Apr 17 '17 22:04 soumagne