iceoryx
iceoryx copied to clipboard
Expose the MemPoolInfo of matching memory pools to typed endpoints
Brief feature description
I would like the ability to somehow query the info of a particular memory pool that will be used for a comms endpoint's messages.
Detailed information
A full proof of concept is linked below:
https://github.com/eclipse-iceoryx/iceoryx/compare/master...gpalmer-latai:iceoryx:gpalmer/expose_mempool
The basic gist is that I've added a m_basePtr
to the MemPoolInfo
struct to inform consumers about where the actual memory pool is located. Then I've exposed some APIs to be able to lift that information all the way out to a typed publisher.
Alternatives Considered
Instead of lifting out the memory pool information through the layers of port info, it might be nice/favorable if we could
- Have more control over which comms endpoints use which memory pools by e.g. adding service info constraints to the memory pool config entries in the roudi config (only these publishers/subscribers use this memory pool), as well as "memory constraints" (the memory type of this pool is "pinned")
- Being able to query the runtime/roudi directly for information about a particular memory pool. I know there is some infrastructure here for introspection, but I'm not sure how I could use that to find the particular memory pool of interest.
@gpalmer-latai it seems you want to use this for GPU computation. I've no experience with GPUs and would need to ask e.g. @elfenpiff but I'm wondering if your goals could also be achieved by extending the publisher and subscriber, e.g. by adding hooks like on_construction
, on_destruction
, on_pre_publish
, on_post_publish
, and so on. I have to think this through if it is easily feasible.
I think alternative 1 will be quite a lot of work. Funnily this is exactly the approach we are taking with the Rust implementation. There is exactly one memory pool per service. Having something like that in iceoryx would also possible but is definitely nothing that can be achieved in the short term.
Regarding option 2. With the introspection you might be able to find a memory pool for a specific service but for that you would also need to know under which user the process with the publisher/subscriber was started and you won't be able to acquire the base pointer. What exactly do you hope to achieve with the information from the introspection?
In the short term something like your proof of concept should be possible. Maybe with a more constraint API but that is more of a detail. My preferred solution would be the refactoring of the posh layer similar to the Rust implementation where we experimented with these ideas and have good experience.
Having on_construction
instead of exposing getMemPoolInfo
would work nicely for me yes. That only really avoids exposing the last layer of lifting I did (and of course one could always pass a lambda that lifts the info out themselves).
To give you a bit more context on my use case - I'm working on a platform with shared graphics memory. That means that the CPU and GPU can access the same memory. But this doesn't just work out of the box - you have to first call an NVIdia API - specifically cudaHostRegister. This API takes a simple void*, size, and some flags and what it effectively does is "pins" the memory - page-locking it. I'm fuzzy on the details beyond that but I believe this has implications on the TLB and associated memory hardware and the reason why I want to restrict the range of memory I call this on as much as possible is because, quoted from the above documentations:
Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.
Furthermore, I cannot simply call this API every time I receive a message sample because it is a high latency operation, so ideally I want to make all the appropriate calls once early on when initializing the system.
Anyway. My plan is to get a short term solution working for my needs and I would very happy if we could work out a reasonable way to do this upstream here. But I also agree with you that the proper long term solution is what you have already described in the Rust implementation. I'd be interested in having a lengthier discussion sometime about the work involved with bringing that to Iceoryx (or adopting the Rust implementation). I might be able to avoid needing it now, but I suspect there will be more use cases/requirements I stumble upon where I will be really sad not to have this level of control.
There should be a way forward to bring this upstream. I think I have to take counsel with my pillow to think a little bit more about the best short term approach, especially what implication is has for the untyped API. Would you contribute the patch once we agreed on the API?
Once the announcement for the Rust implementation is out we can have a deep dive. Either as part of a developer meetup or a separate meeting. I guess @elfenpiff is also eager to participate.
Yeah I should have some latitude to work on the patch.
Yeah I should have some latitude to work on the patch.
Hahaha, great pun :)
I had a few thought on this topic and currently I favor to continue on the idea we had when PortConfigInfo
was introduced. All of this could be done in the publisher. Just for clarification, do you need to call cudaHostRegister also on the subscriber side or only on the publisher side?
I'd need the ability to map memory pools do be able to process messages as both input and output to GPU accelerated algorithms. So yes, subscriber too.
With the portconfiginfo - you could specify a different memory type for a port, but then how does this hook in to facilitating different hardware. I presume iceoryx isn't going to pull in Nvidia dependencies, so you'd need like the ability for the user to register behavior globally for different memory types.
I've to ask the author of that code what the exact idea behind PortConfigInfo
but I thought about either having a compile time switch to add the CUDA dependency or maybe a similar mechanism like with the platform where one can override the path to some hooks.
I like the idea of a compile time switch, possibly coupled with some bazel select statements. Then just have a compile time path that either calls the CUDA API or returns an error when a specific memory type is specified in the port config.
One thing to note though is that it appears the port config / memory_info will apply to the entire memory segment and it is currently not possible to configure multiple segments differing only by what type of memory they support.
We'd want to probably extend the Roudi config this way so that a publisher with a port config specifying "pinned host memory" would be told to use the corresponding segment.
That's basically the idea. The end user would still not have easy access to the mempool and therefore also cannot easily tamper with it but it gives other projects the option to extend the framework.
Yes, mid to long term the config should be extended. How to do it exactly might be something to discuss in a meetup. Maybe together with the discussion on how to bring the improvements of the Rust based implementation to the C++ implementation.
Yes, mid to long term the config should be extended
Actually I mean in the short/medium term. I'm really trying to avoid pinning the entire shared memory segment used for all communications just because one publisher wants to use GPU acceleration when publishing messages in one specific mempool.
If the effect of specifying memoryType
as something signifying pinned_host_memory
is that we decide to call cudaHostRegister
when constructing the mepoo_segment
~~somewhere here~~, then I'd be forced to do something like put all "GPU publishers/subscribers" in separate linux groups just to get them to use the isolated segment. And I'd like to avoid conflating the linux group / access controls use case with the memory allocation/access one.
Edit: What I linked is the codepath for Roudi setting up the shared memory in the first place. I meant to find the code path where the clients receive the info from Roudi.
Second Edit: The publisher (for example) get's this information here. It then has access through the port()->getMembers()->m_chunkSenderData.m_memoryInfo
. However I'm not sure if there exists any logic currently to do some sort of "post init" to react to that information.
What might be interesting to consider as well is adding an entry to both MePooConfig and memPoolInfo.
This would allow configuration at the memory pool level, which fits with my needs. Then we can see about adding some sort of "post_init" logic that is called after getMiddleWarePublisher
to do setup on the client side. This boil down to the MemoryManager
going through all of the memory pools and for each one with a pinned_host
MemoryType
calling cudaHostRegister
on the memory region.
And this is something that could be added to the roudi config TOML as well at some point, though is also immediately available through manual roudi configuration.
There is still a bit of awkwardness to all of these approaches in that although a segment or a memory pool might be configured as supporting "pinned host memory", really it is ultimately up to the publishers or subscribers to decide if they want to use that feature. You have to consider that there are cases where the publisher is a normal CPU publisher or the subscriber is a normal CPU subscriber. In these processes you wouldn't really want to call cudaHostRegister
at all.
So I guess, what you would do to handle this is to combine the MemPoolInfo
(or segment info) with the PortConfigInfo
to have some logic like:
// MemoryManager.getChunk()
for (auto& memPool : m_memPoolVector)
{
uint32_t chunkSizeOfMemPool = memPool.getChunkSize();
if (chunkSizeOfMemPool >= requiredChunkSize)
{
if (portConfigInfo.memoryInfo.memoryType == PINNED_HOST_MEMORY)
{
if (memPool.getInfo()->memoryType == PINNED_HOST_MEMORY)
{
chunk = memPool.getChunk();
memPoolPointer = &memPool;
aquiredChunkSize = chunkSizeOfMemPool;
break;
}
...
Note this code path is when you are retrieving a chunk which would happen much later than where you want to call cudaHostRegister
on the appropriate segment/pool.
Maybe there is a misunderstanding. What I meant was to only register the mempool the publisher and subscriber are using and not the whole segment. As long as it the memory is shared between CPU and GPU this should be sufficient and RouDi does not need to do any special handling. But I'm no GPU expert so this might be a bit naive.
So with that assumption I thought extending the config would not be required in the short term but only in the mid to long term.
We already have a concept of a shared memory provider, although only used and tested on CPU memory. This could be used to extend the memory types. The idea was to have this on segment level not on mempool level. It is quite long since we discussed about this topic though and some of our assumptions might not hold anymore.
Yes, the publisher and subscriber would opt in by specifying the desired memory type. I think that should be more or less straightforward. What might be more complicated is mixed memory mode where some subscriber work on CPU memory and other work on GPU memory. To solve this in an efficient way we might need to do a whiteboard session.
Oh, and @elfenpiff already had some ideas how to solve this in an efficient way about 4 years ago but there was never time to implement it. Maybe time has come now ... or at least in the not so distant future :smile:
Okay. I think I follow now. I guess all one needs is to specify something in the portConfigInfo
and have that consumed to make the cudaHostRegister
call 🙂 You mentioned earlier one complication though - what about untyped publishers/subscribers?
Do we simply not support those initially with this feature, since we do not know in advance which memory pools they will request?
Related to the above - I've noticed that we appear to always do a linear search for the matching memory pool every time we get a chunk. Would it make sense to optimize this out for the typed endpoints in conjunction with the memory logic that is applied only to the corresponding memory pools? (Also in general we could consider doing a binary search)
Note that there is an approach where you defer until getChunk
to do something like cuda_pinned_host_registry.ensure_pinned(memPool)
to make the call to cudaHostRegister
the first time a chunk from a MemPool is requested. This would work for untyped endpoint too.
It's a bit problematic though because that first publish/subscribe for a particular size message will incur a lot of latency. Though you could mitigate this by "priming the pump" - calling loan_message
of all the sizes you care about on system startup, dropping them, and then proceeding with business as usual.
For the untyped publisher and subscriber we have two options (three after I read your latest response)
- do not support it initially
- add an API call like
likely_used_memory_sizes(iox::vector<uint64_t, ???> list_of_memory_sizes)
- like 2 but on
loan
/getChunk
; might be a fallback for 2 with a log message warning about the new registration
Oh, the typed API would then just call likely_used_memory_sizes
on the port level
That sounds great to me. So let me see if I can summarize:
- Add some
memoryType
constants somewhere that will be used to populate this field in theMemoryInfo
struct member ofPortConfigInfo
. - Expose in the publisher and subscriber APIs the ability to pass in a non-default
PortConfigInfo
. - For untyped endpoints, also extend the api to take a
likely_used_memory_sizes
vector on construction. - For typed endpoints, automatically supply that information based on the type.
- Add some compile-time flag indicating whether the
cuda_runtime_api
is available. - Create a global repository similar to the
pointer_repository
to track registration calls (we may have multiple endpoints requesting the same memory region be pinned. CUDA will set an error code the second time you try to do this so it is better to wrap such calls and cache the information - also for latency reasons). - In the
MemoryManager
, extend the API to consume thelikely_used_memory_sizes
. This might be used in general for some optimization ingetChunk
, but additionally if the compile time flag is active is used to callcudaHostRegister
on all appropriate memory pools. - Error out early if the port config info requests pinned host memory but the compile time flag is not active.
- When calling
getChunk
- if the requested size matches any of thelikely_used_memory_sizes
, we simply return the matching memory pool. Otherwise we do the normal linear search and additionally make thecudaHostRegister
call if thePortConfigInfo
requests it, and if the compile time flag is active.
Here some remarks
- We should probably reserve a range for a curated list of memory types, similar to what IANA does
- Ideally with this we could introduce a builder like API which gives us the opportunity to return an
iox::expected
and leave it to the user to decide what happens when there are not resources available. The builder could also be tied to the runtime or at least have the option to ask the non-default runtime for a port. There is the idea to have multiple RouDi instances running in parallel which do not interfere with each other. In order to write gateways between these instances one would need to have multiple runtimes. This is something in the far future and maybe we can get rid of RouDi in the C++ implementation but having an API which considers this option would be nice. - :+1: we might also end up to just have multiple calls to
likely_used_memory_size
. A detail which can be decided later on depending on what feels better to use - :+1:
- It would probably be similar to the mechanism we currently use to use a custom platform layer. By default only CPU memory would be supported and the user could choose multiple other options. CUDA might be an option which could be upstreamed
- Yes, this must be tracked for each process and cannot be stored in the mempool itself
- I'm not quite sure I understand this correctly. My idea was to call
cudaHostRegister
in thePublisherPortUser
not in theMemoryManager
. TheMemoryManager
needs the interface from your proof of concept to return the address and size of the mempool for a specific chunk size. ThegetChunk
method could get an additional parameter with a hint for the index, e.g.index_of_last_used_mempool
. If the size does not match the mempool at the index the current search could be performed or alternatively a binary search. The caching of the known indices could also be done in thePublisherPortUser
since theMemoryManager
resides in the shared memory and we would need to hold a structure for(IOX_MAX_PUBLISHER + IOX_MAX_SUBSCRIBERS) * 2
(aaaargh, I used IOX_MAX_PUBLISHER to define MAX_SERVERS ... facepalm :laughing: ) - :+1:
- I think this is also something for the
PublisherPortUser
since the repository forcudoHostRegister
would live in process local memory and not in the shared memory ... but maybe I overlooked something
I think you are right about the PublisherPortUser
owning a lot of the logic 👍
We should probably reserve a range for a curated list of memory types, similar to what IANA does
That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range.
That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range.
Exactly. I've thought about something similar to Qt::UserRole
, which is the offset when defining own roles in Qt's model-view framework. We just need to find a good name and set it to .e.g 0x100
. This should leave us a range which is big enough for whatever we want to use. Downstream can then request to register a specific value directly in iceoryx and after a two weeks discussion on Oʻahu (downstream will take care of the hotel and traveling costs) the team will vote on the inclusion of the new value.
... aaargh, too much Qt coding :laughing: ... we can just define an enum to distinguish between user defined values and the ones defined by iceoryx
Ideally with this we could introduce a builder like API which gives us the opportunity to return an iox::expected and leave it to the user to decide what happens when there are not resources available. The builder could also be tied to the runtime or at least have the option to ask the non-default runtime for a port. There is the idea to have multiple RouDi instances running in parallel which do not interfere with each other. In order to write gateways between these instances one would need to have multiple runtimes. This is something in the far future and maybe we can get rid of RouDi in the C++ implementation but having an API which considers this option would be nice.
So is the idea that the builder is passed as an argument like PortConfigInfoBuilder
and the publisher calls create()
on it? Or is this pattern more of just a convenient way to get an object with reasonable defaults but the ability to override some if necessary?
When you talk about tying the builder to the runtime, I'm guessing you mean that instead of
m_port(iox::runtime::PoshRuntime::getInstance().getMiddlewarePublisher(service, publisherOptions))
You'd have something like
m_port(portInfo.getRuntime().getMiddlewarePublisher(service, publisherOptions, portInfo))
Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem.
Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem.
Exactly. We don't use exception and when something went horribly wrong in the ctor we currently terminate. It would be better to use the factory method pattern or the builder pattern and return an expected<Publisher>
.