pycuda
pycuda copied to clipboard
Is there a way to share context among threads, if not why?
Basically I want to achieve concurrent work with multithreading and my current inference code is pycuda + tensorrt. why I want to do so I'm trying to optimize the inference throughput for a model with dynamic input. the size difference between samples can be quite significant. So I want to avoid padding but still do something similar to batching, I want to run several samples concurrently with the same engine. The inference time will still be bottlenecked by the biggest sample in the batch but a lot of flops are saved, also it prevents possible performance drop from padding too much.
my current understanding of the problem From what I understood If works are in different cuda contexts there is no real parallel working, instead it is just better scheduling. Also one process can only have one cuda context but threads can share contexts. It may not be true for pycuda so I need to . But I didn't find anything talking about how to share one context among threads yet.
I found the official example here for using multithreading with pycuda link
Device.make_context() There's not much difference between multithreading and multiprocess then. If each thread owns it's own context then there is no real concurrent work.
My question: I just wonder if my understanding on context is right. And I wonder if there is a way to share context between different threads. I feel it should be possible, if it is not possible with pycuda, can anyone briefly explain why?
I just learned abou GIL. So we have to use multiprocess and I can only dive into mpi to solve my problem if I want to stick with python?
This also came up recently in https://github.com/inducer/pycuda/issues/305#issuecomment-887761332. PyCUDA currently assumes that each context can only be active in a single thread. It appears that this was true up until CUDA 4, but this restriction was then lifted. I would welcome a PR that removes this restriction. It might be as simple as deleting the check for uniqueness of activation.
Yes I also saw it. I may switch to polygraphy instead. I don't know much about cuda wrappers and I chose pycuda only because official tensorrt example used it. But the test code in tensorrt used polygraphy instead. But it seems like polygraphy hide all details about contexts. Hope that it can work.
The nvidia guys told me that tensorrt inference releases GIL. That's a good news, if the new feature would be added it can be useful in this case.
How come this got closed? The question you raised is a real concern to my mind, and I wouldn't be opposed to the issue staying open.
okay, just one quick question also. I found that pycuda is a lot quicker than polygraphy when doing memcpy. Do you know the reason?
PyCUDA isn't doing anything special with memcpy. It just calls the corresponding CUDA function. For an additional speedboost, you can use "page-locked" memory (on the host side).
k, I'll try to read the source code myself...
@menglin0320 similar situation with you, and I have a solution How to perform different models in different gpu simultaneously
Would fixing this also make it possible for CUDA objects to be safely garbage collected in threads where the context is not current?