iceoryx icon indicating copy to clipboard operation
iceoryx copied to clipboard

Expose the MemPoolInfo of matching memory pools to typed endpoints

Open gpalmer-latai opened this issue 1 year ago • 50 comments

Brief feature description

I would like the ability to somehow query the info of a particular memory pool that will be used for a comms endpoint's messages.

Detailed information

A full proof of concept is linked below:

https://github.com/eclipse-iceoryx/iceoryx/compare/master...gpalmer-latai:iceoryx:gpalmer/expose_mempool

The basic gist is that I've added a m_basePtr to the MemPoolInfo struct to inform consumers about where the actual memory pool is located. Then I've exposed some APIs to be able to lift that information all the way out to a typed publisher.

Alternatives Considered

Instead of lifting out the memory pool information through the layers of port info, it might be nice/favorable if we could

  1. Have more control over which comms endpoints use which memory pools by e.g. adding service info constraints to the memory pool config entries in the roudi config (only these publishers/subscribers use this memory pool), as well as "memory constraints" (the memory type of this pool is "pinned")
  2. Being able to query the runtime/roudi directly for information about a particular memory pool. I know there is some infrastructure here for introspection, but I'm not sure how I could use that to find the particular memory pool of interest.

gpalmer-latai avatar Nov 14 '23 14:11 gpalmer-latai

@gpalmer-latai it seems you want to use this for GPU computation. I've no experience with GPUs and would need to ask e.g. @elfenpiff but I'm wondering if your goals could also be achieved by extending the publisher and subscriber, e.g. by adding hooks like on_construction, on_destruction, on_pre_publish, on_post_publish, and so on. I have to think this through if it is easily feasible.

I think alternative 1 will be quite a lot of work. Funnily this is exactly the approach we are taking with the Rust implementation. There is exactly one memory pool per service. Having something like that in iceoryx would also possible but is definitely nothing that can be achieved in the short term.

Regarding option 2. With the introspection you might be able to find a memory pool for a specific service but for that you would also need to know under which user the process with the publisher/subscriber was started and you won't be able to acquire the base pointer. What exactly do you hope to achieve with the information from the introspection?

In the short term something like your proof of concept should be possible. Maybe with a more constraint API but that is more of a detail. My preferred solution would be the refactoring of the posh layer similar to the Rust implementation where we experimented with these ideas and have good experience.

elBoberido avatar Nov 14 '23 22:11 elBoberido

Having on_construction instead of exposing getMemPoolInfo would work nicely for me yes. That only really avoids exposing the last layer of lifting I did (and of course one could always pass a lambda that lifts the info out themselves).

To give you a bit more context on my use case - I'm working on a platform with shared graphics memory. That means that the CPU and GPU can access the same memory. But this doesn't just work out of the box - you have to first call an NVIdia API - specifically cudaHostRegister. This API takes a simple void*, size, and some flags and what it effectively does is "pins" the memory - page-locking it. I'm fuzzy on the details beyond that but I believe this has implications on the TLB and associated memory hardware and the reason why I want to restrict the range of memory I call this on as much as possible is because, quoted from the above documentations:

Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.

Furthermore, I cannot simply call this API every time I receive a message sample because it is a high latency operation, so ideally I want to make all the appropriate calls once early on when initializing the system.

Anyway. My plan is to get a short term solution working for my needs and I would very happy if we could work out a reasonable way to do this upstream here. But I also agree with you that the proper long term solution is what you have already described in the Rust implementation. I'd be interested in having a lengthier discussion sometime about the work involved with bringing that to Iceoryx (or adopting the Rust implementation). I might be able to avoid needing it now, but I suspect there will be more use cases/requirements I stumble upon where I will be really sad not to have this level of control.

gpalmer-latai avatar Nov 15 '23 17:11 gpalmer-latai

There should be a way forward to bring this upstream. I think I have to take counsel with my pillow to think a little bit more about the best short term approach, especially what implication is has for the untyped API. Would you contribute the patch once we agreed on the API?

Once the announcement for the Rust implementation is out we can have a deep dive. Either as part of a developer meetup or a separate meeting. I guess @elfenpiff is also eager to participate.

elBoberido avatar Nov 15 '23 23:11 elBoberido

Yeah I should have some latitude to work on the patch.

gpalmer-latai avatar Nov 16 '23 03:11 gpalmer-latai

Yeah I should have some latitude to work on the patch.

Hahaha, great pun :)

elBoberido avatar Nov 16 '23 12:11 elBoberido

I had a few thought on this topic and currently I favor to continue on the idea we had when PortConfigInfo was introduced. All of this could be done in the publisher. Just for clarification, do you need to call cudaHostRegister also on the subscriber side or only on the publisher side?

elBoberido avatar Nov 19 '23 22:11 elBoberido

I'd need the ability to map memory pools do be able to process messages as both input and output to GPU accelerated algorithms. So yes, subscriber too.

With the portconfiginfo - you could specify a different memory type for a port, but then how does this hook in to facilitating different hardware. I presume iceoryx isn't going to pull in Nvidia dependencies, so you'd need like the ability for the user to register behavior globally for different memory types.

gpalmer-latai avatar Nov 19 '23 22:11 gpalmer-latai

I've to ask the author of that code what the exact idea behind PortConfigInfo but I thought about either having a compile time switch to add the CUDA dependency or maybe a similar mechanism like with the platform where one can override the path to some hooks.

elBoberido avatar Nov 19 '23 23:11 elBoberido

I like the idea of a compile time switch, possibly coupled with some bazel select statements. Then just have a compile time path that either calls the CUDA API or returns an error when a specific memory type is specified in the port config.

gpalmer-latai avatar Nov 20 '23 00:11 gpalmer-latai

One thing to note though is that it appears the port config / memory_info will apply to the entire memory segment and it is currently not possible to configure multiple segments differing only by what type of memory they support.

We'd want to probably extend the Roudi config this way so that a publisher with a port config specifying "pinned host memory" would be told to use the corresponding segment.

gpalmer-latai avatar Nov 20 '23 00:11 gpalmer-latai

That's basically the idea. The end user would still not have easy access to the mempool and therefore also cannot easily tamper with it but it gives other projects the option to extend the framework.

Yes, mid to long term the config should be extended. How to do it exactly might be something to discuss in a meetup. Maybe together with the discussion on how to bring the improvements of the Rust based implementation to the C++ implementation.

elBoberido avatar Nov 20 '23 14:11 elBoberido

Yes, mid to long term the config should be extended

Actually I mean in the short/medium term. I'm really trying to avoid pinning the entire shared memory segment used for all communications just because one publisher wants to use GPU acceleration when publishing messages in one specific mempool.

If the effect of specifying memoryType as something signifying pinned_host_memory is that we decide to call cudaHostRegister when constructing the mepoo_segment ~~somewhere here~~, then I'd be forced to do something like put all "GPU publishers/subscribers" in separate linux groups just to get them to use the isolated segment. And I'd like to avoid conflating the linux group / access controls use case with the memory allocation/access one.

Edit: What I linked is the codepath for Roudi setting up the shared memory in the first place. I meant to find the code path where the clients receive the info from Roudi.

Second Edit: The publisher (for example) get's this information here. It then has access through the port()->getMembers()->m_chunkSenderData.m_memoryInfo. However I'm not sure if there exists any logic currently to do some sort of "post init" to react to that information.

gpalmer-latai avatar Nov 20 '23 15:11 gpalmer-latai

What might be interesting to consider as well is adding an entry to both MePooConfig and memPoolInfo.

This would allow configuration at the memory pool level, which fits with my needs. Then we can see about adding some sort of "post_init" logic that is called after getMiddleWarePublisher to do setup on the client side. This boil down to the MemoryManager going through all of the memory pools and for each one with a pinned_host MemoryType calling cudaHostRegister on the memory region.

And this is something that could be added to the roudi config TOML as well at some point, though is also immediately available through manual roudi configuration.

gpalmer-latai avatar Nov 20 '23 15:11 gpalmer-latai

There is still a bit of awkwardness to all of these approaches in that although a segment or a memory pool might be configured as supporting "pinned host memory", really it is ultimately up to the publishers or subscribers to decide if they want to use that feature. You have to consider that there are cases where the publisher is a normal CPU publisher or the subscriber is a normal CPU subscriber. In these processes you wouldn't really want to call cudaHostRegister at all.

So I guess, what you would do to handle this is to combine the MemPoolInfo (or segment info) with the PortConfigInfo to have some logic like:

// MemoryManager.getChunk()
for (auto& memPool : m_memPoolVector)
{
      uint32_t chunkSizeOfMemPool = memPool.getChunkSize();
      if (chunkSizeOfMemPool >= requiredChunkSize)
      {
        if (portConfigInfo.memoryInfo.memoryType == PINNED_HOST_MEMORY)
        {
          if (memPool.getInfo()->memoryType == PINNED_HOST_MEMORY)
          {
            chunk = memPool.getChunk();
            memPoolPointer = &memPool;
            aquiredChunkSize = chunkSizeOfMemPool;
            break;
          }
...

Note this code path is when you are retrieving a chunk which would happen much later than where you want to call cudaHostRegister on the appropriate segment/pool.

gpalmer-latai avatar Nov 20 '23 15:11 gpalmer-latai

Maybe there is a misunderstanding. What I meant was to only register the mempool the publisher and subscriber are using and not the whole segment. As long as it the memory is shared between CPU and GPU this should be sufficient and RouDi does not need to do any special handling. But I'm no GPU expert so this might be a bit naive.

So with that assumption I thought extending the config would not be required in the short term but only in the mid to long term.

We already have a concept of a shared memory provider, although only used and tested on CPU memory. This could be used to extend the memory types. The idea was to have this on segment level not on mempool level. It is quite long since we discussed about this topic though and some of our assumptions might not hold anymore.

elBoberido avatar Nov 20 '23 15:11 elBoberido

Yes, the publisher and subscriber would opt in by specifying the desired memory type. I think that should be more or less straightforward. What might be more complicated is mixed memory mode where some subscriber work on CPU memory and other work on GPU memory. To solve this in an efficient way we might need to do a whiteboard session.

elBoberido avatar Nov 20 '23 15:11 elBoberido

Oh, and @elfenpiff already had some ideas how to solve this in an efficient way about 4 years ago but there was never time to implement it. Maybe time has come now ... or at least in the not so distant future :smile:

elBoberido avatar Nov 20 '23 15:11 elBoberido

Okay. I think I follow now. I guess all one needs is to specify something in the portConfigInfo and have that consumed to make the cudaHostRegister call 🙂 You mentioned earlier one complication though - what about untyped publishers/subscribers?

Do we simply not support those initially with this feature, since we do not know in advance which memory pools they will request?

Related to the above - I've noticed that we appear to always do a linear search for the matching memory pool every time we get a chunk. Would it make sense to optimize this out for the typed endpoints in conjunction with the memory logic that is applied only to the corresponding memory pools? (Also in general we could consider doing a binary search)

gpalmer-latai avatar Nov 20 '23 15:11 gpalmer-latai

Note that there is an approach where you defer until getChunk to do something like cuda_pinned_host_registry.ensure_pinned(memPool) to make the call to cudaHostRegister the first time a chunk from a MemPool is requested. This would work for untyped endpoint too.

It's a bit problematic though because that first publish/subscribe for a particular size message will incur a lot of latency. Though you could mitigate this by "priming the pump" - calling loan_message of all the sizes you care about on system startup, dropping them, and then proceeding with business as usual.

gpalmer-latai avatar Nov 20 '23 15:11 gpalmer-latai

For the untyped publisher and subscriber we have two options (three after I read your latest response)

  1. do not support it initially
  2. add an API call like likely_used_memory_sizes(iox::vector<uint64_t, ???> list_of_memory_sizes)
  3. like 2 but on loan/getChunk; might be a fallback for 2 with a log message warning about the new registration

elBoberido avatar Nov 20 '23 15:11 elBoberido

Oh, the typed API would then just call likely_used_memory_sizes on the port level

elBoberido avatar Nov 20 '23 15:11 elBoberido

That sounds great to me. So let me see if I can summarize:

  1. Add some memoryType constants somewhere that will be used to populate this field in the MemoryInfo struct member of PortConfigInfo.
  2. Expose in the publisher and subscriber APIs the ability to pass in a non-default PortConfigInfo.
  3. For untyped endpoints, also extend the api to take a likely_used_memory_sizes vector on construction.
  4. For typed endpoints, automatically supply that information based on the type.
  5. Add some compile-time flag indicating whether the cuda_runtime_api is available.
  6. Create a global repository similar to the pointer_repository to track registration calls (we may have multiple endpoints requesting the same memory region be pinned. CUDA will set an error code the second time you try to do this so it is better to wrap such calls and cache the information - also for latency reasons).
  7. In the MemoryManager, extend the API to consume the likely_used_memory_sizes. This might be used in general for some optimization in getChunk, but additionally if the compile time flag is active is used to call cudaHostRegister on all appropriate memory pools.
  8. Error out early if the port config info requests pinned host memory but the compile time flag is not active.
  9. When calling getChunk - if the requested size matches any of the likely_used_memory_sizes, we simply return the matching memory pool. Otherwise we do the normal linear search and additionally make the cudaHostRegister call if the PortConfigInfo requests it, and if the compile time flag is active.

gpalmer-latai avatar Nov 20 '23 16:11 gpalmer-latai

Here some remarks

  1. We should probably reserve a range for a curated list of memory types, similar to what IANA does
  2. Ideally with this we could introduce a builder like API which gives us the opportunity to return an iox::expected and leave it to the user to decide what happens when there are not resources available. The builder could also be tied to the runtime or at least have the option to ask the non-default runtime for a port. There is the idea to have multiple RouDi instances running in parallel which do not interfere with each other. In order to write gateways between these instances one would need to have multiple runtimes. This is something in the far future and maybe we can get rid of RouDi in the C++ implementation but having an API which considers this option would be nice.
  3. :+1: we might also end up to just have multiple calls to likely_used_memory_size. A detail which can be decided later on depending on what feels better to use
  4. :+1:
  5. It would probably be similar to the mechanism we currently use to use a custom platform layer. By default only CPU memory would be supported and the user could choose multiple other options. CUDA might be an option which could be upstreamed
  6. Yes, this must be tracked for each process and cannot be stored in the mempool itself
  7. I'm not quite sure I understand this correctly. My idea was to call cudaHostRegister in the PublisherPortUser not in the MemoryManager. The MemoryManager needs the interface from your proof of concept to return the address and size of the mempool for a specific chunk size. The getChunk method could get an additional parameter with a hint for the index, e.g. index_of_last_used_mempool. If the size does not match the mempool at the index the current search could be performed or alternatively a binary search. The caching of the known indices could also be done in the PublisherPortUser since the MemoryManager resides in the shared memory and we would need to hold a structure for (IOX_MAX_PUBLISHER + IOX_MAX_SUBSCRIBERS) * 2 (aaaargh, I used IOX_MAX_PUBLISHER to define MAX_SERVERS ... facepalm :laughing: )
  8. :+1:
  9. I think this is also something for the PublisherPortUser since the repository for cudoHostRegister would live in process local memory and not in the shared memory ... but maybe I overlooked something

elBoberido avatar Nov 20 '23 17:11 elBoberido

I think you are right about the PublisherPortUser owning a lot of the logic 👍

gpalmer-latai avatar Nov 20 '23 18:11 gpalmer-latai

We should probably reserve a range for a curated list of memory types, similar to what IANA does

That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range.

gpalmer-latai avatar Nov 20 '23 19:11 gpalmer-latai

That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range.

Exactly. I've thought about something similar to Qt::UserRole, which is the offset when defining own roles in Qt's model-view framework. We just need to find a good name and set it to .e.g 0x100. This should leave us a range which is big enough for whatever we want to use. Downstream can then request to register a specific value directly in iceoryx and after a two weeks discussion on Oʻahu (downstream will take care of the hotel and traveling costs) the team will vote on the inclusion of the new value.

elBoberido avatar Nov 20 '23 20:11 elBoberido

... aaargh, too much Qt coding :laughing: ... we can just define an enum to distinguish between user defined values and the ones defined by iceoryx

elBoberido avatar Nov 20 '23 20:11 elBoberido

Ideally with this we could introduce a builder like API which gives us the opportunity to return an iox::expected and leave it to the user to decide what happens when there are not resources available. The builder could also be tied to the runtime or at least have the option to ask the non-default runtime for a port. There is the idea to have multiple RouDi instances running in parallel which do not interfere with each other. In order to write gateways between these instances one would need to have multiple runtimes. This is something in the far future and maybe we can get rid of RouDi in the C++ implementation but having an API which considers this option would be nice.

So is the idea that the builder is passed as an argument like PortConfigInfoBuilder and the publisher calls create() on it? Or is this pattern more of just a convenient way to get an object with reasonable defaults but the ability to override some if necessary?

When you talk about tying the builder to the runtime, I'm guessing you mean that instead of

m_port(iox::runtime::PoshRuntime::getInstance().getMiddlewarePublisher(service, publisherOptions))

You'd have something like

m_port(portInfo.getRuntime().getMiddlewarePublisher(service, publisherOptions, portInfo))

gpalmer-latai avatar Nov 20 '23 22:11 gpalmer-latai

Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem.

gpalmer-latai avatar Nov 20 '23 22:11 gpalmer-latai

Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem.

Exactly. We don't use exception and when something went horribly wrong in the ctor we currently terminate. It would be better to use the factory method pattern or the builder pattern and return an expected<Publisher>.

elBoberido avatar Nov 20 '23 22:11 elBoberido