serving Monitoring actual GPU memory usage

Describe the problem the feature is intended to solve

I have several models loaded and not sure how can I know if Tensorflow still has some memory left. I can check using nvidia-smi how much memory is allocated by Tensorflow but I couldn't find a way to check loaded models usage.

Describe the solution

Tensorflow could provide some metrics for Prometheus about actual GPU memory usage by each loaded model.

Describe alternatives you've considered

None.

Additional context

I am not sure if this is actually a feature request or it can be done somehow at the moment.

Aug 02 '19 08:08 ctuluhu

@ctuluhu , Can you please check this link and let us know if it helps.

Aug 05 '19 09:08 rmothukuru

@rmothukuru Thank you for your response but I didn't find what I am looking for in provided link. What I need is a way to get the amount of free memory allocated by Tensorflow.

Issue eg: Tensorflow allocated 6GB memory, later I have loaded two models into Tensorflow memory, how can I know how much of this 6GB is used by loaded models and how much of this memory is free?

Aug 05 '19 12:08 ctuluhu

Hi there, we can easily export metrics that tell you host memory consumption on a per model basis but I think you're specifically looking for GPU's memory consumption/availability correct? This is not straight forward but we will discuss it internally to understand exactly how difficult it would be an provide an update. Let us know if you have any additional info that you think might be useful for us to know!

Aug 10 '19 00:08 peddybeats

Hi, yes I am looking for a way to check GPU's memory availability. Such a feature would be great.

Aug 12 '19 12:08 ctuluhu

We also met this problem. TF-serving occupied all GPU memory when it started and there is no way to know how much memory really needed for a specific model. If we deploy too much models in a server instance, sometimes it will hang up and do not response , all connections to it will timeout. Thus, for multiple models, we need to do lot of load test to decide which can be deployed together in one instance and which need to be deployed in another.

Aug 23 '19 08:08 troycheng

@ctuluhu @troycheng @unclepeddy one way to mitigate the problem is to use environment flag TF_FORCE_GPU_ALLOW_GROWTH=true when you launch your model server. It'll grab minimum required GPU memory at startup and gradually increase the consumption as needed. Please let me know if it works.

Aug 26 '19 18:08 aaroey

Please also see this stackoverflow question about how to monitor memory usage using memory_stat ops and run_metadata.

Aug 26 '19 19:08 aaroey

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

Sep 12 '19 22:09 rmothukuru

Is there any update on this issue? In TF 2.2 I still don't see an easy way to measure actually and peak memory usage

Jun 12 '20 23:06 patrickvonplaten

why close

Nov 11 '20 03:11 junneyang

@ctuluhu @troycheng @unclepeddy one way to mitigate the problem is to use environment flag TF_FORCE_GPU_ALLOW_GROWTH=true when you launch your model server. It'll grab minimum required GPU memory at startup and gradually increase the consumption as needed. Please let me know if it works.

I believe this can be considered as a basic solution to the problem. But the GPU memory usage cannot be fully separated according to the model loaded as part of the GPU memory usage are cost by stuff like CUDA context, which is shared among loaded models.

Meanwhile, it seems there should be a limit for each model's GPU memory growth, which should be related to the parameters of the model and max batch size set to TFServing.

Also, TF_FORCE_GPU_ALLOW_GROWTH=true should not affect the latency of TFServing for handling request after the first request (if the memory is allocated for the entire batch size). The GPU memory allocated before seems not deallocated if no further request received.

@gaocegege

Feb 26 '21 08:02 zw0610

Any workaround for this ? atleast using the metrics in prometheus ?

Mar 09 '21 16:03 Akhp888

We still don't see an easy way to monitor GPU memory usage. Is there any progress?

Aug 06 '21 07:08 gaocegege

It's been 2 years, 4 days and we still don't have any update on one of the most vital part.

Great!

Aug 06 '21 11:08 spate141

Sorry for the late reply.

Could you try the memory profiling tool to see if it helps https://www.tensorflow.org/guide/profiler#memory_profile_tool? You could also take a look at https://www.tensorflow.org/api_docs/python/tf/config/experimental/get_memory_info, which provides the current and peak memory that TensorFlow is actually using.  However, this might not work since the models are running online on c++ model servers.

Aug 10 '21 23:08 guanxinq

@guanxinq I think people are more interested in something from the tensorflow/serving side! Example use case could be monitoring the GPU usage while serving model in production. If it's available through some API endpoint, that information could be useful to scale the cluster or increase the backend workers models. nvidia-smi or the related implementation are there but something directly from the serving would definitely be more useful.

Aug 11 '21 19:08 spate141