serving GPU Memory allocation with multiple cuda stream

sorry for bothering!

I am trying to deploy a model with dynamic connection of our company to production environment. Since the model is dynamically activated, so batching requests and do inference is not a good idea. Instead, I want to use multiple cuda streams to deal with several requests on one GPU concurrently. (one stream for one request)

I have tried libtorch, since it support multiple stream. However, I found that, when using libtorch, the memory allocated by each stream will be cached by that stream, which cannot be reused by other stream. (Suppose there are 2G memory on one GPU, and stream A caches 1G after deal with request1. Now stream B want to deal with request2, stream A will have to first return the memory back to OS, and stream B needs to call cuda malloc, which is very slow.)

I am wondering whether I could use tf-serving to deal with my model. Will the same thing happen with tf-serving? Can different streams in tf-serving reuse cached gpu memory?

I am looking forward for your reply! Thank you so much!

Sep 10 '22 14:09 Joeyzhouqihui

@Joeyzhouqihui,

Kindly take a look into batching guide for TF serving. You can also refer to our official documentation for more details.

Sep 13 '22 08:09 singhniraj08

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!

Oct 30 '22 10:10 singhniraj08