serving CPU Memory occupied by TF Serving even though serving is on GPU

System information

TensorFlow Serving v2.11 installed from source

Describe the problem

We are running TensorFlow Serving on GPU as a separate process which is interacting with the CPU JVM process via gRPC. We observed couple of things

Even though the model is loaded on GPU, TF Serving process on CPU occupies significant space too. We are not using any server side batching. Why does TF serving occupies space that much space on CPU? (~9-10GB)
TF Serving by default occupies all the space on GPU. We can see in tensorboard that actual space used is much less but by default TF serving is reserving all the space on GPU. Is that expected.

For (1) below is a small pmap output. I see .so libraries in the output. Is this expected behavior? Why does TF serving needs to load these libraries on CPU memory?

Address Kbytes RSS Dirty Mode Mapping total kB 173906756 14280120 13943984 00005cca84705000 10124 0 0 r---- tensorflow_model_server 00005cca850e8000 268160 26132 0 r-x-- tensorflow_model_server 00005cca956c8000 513564 940 0 r---- tensorflow_model_server 0000793ba6a15000 17236 9064 0 r-x-- libnvidia-ptxjitcompiler.so.525.85.12 0000793ba7aea000 2044 0 0 ----- libnvidia-ptxjitcompiler.so.525.85.12 00007932dc384000 128 128 128 rw--- libcusparse.so.11.7.5.86 00007932dc3a4000 36 36 36 rw--- [ anon ] 00007932dc3ad000 295496 780 0 r-x-- libcusolver.so.11.4.1.48

Feb 08 '24 00:02 ndeep27

@ndeep27, TF Serving CPU usage can be because of the reason of some models perform nontrivial CPU work, in addition to their main GPU work. While the core matrix operations may run well on a GPU, peripheral operations may take place on a CPU, e.g. embedding lookup, vocabulary lookup, quantization/dequantization. For more information, you can refer here.

Answering your second question, TF serving internally uses Tensorflow runtime for model inference and TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation. To limit GPU memory usage, you can refer here. Currently we have --per_process_gpu_memory_fraction to limit the memory usage of model server and there is no such method available to limit GPU usage on model level. Ref: https://github.com/tensorflow/serving/pull/694

Hope this answers your query. Thank you!

Feb 15 '24 04:02 singhniraj08

Thanks @singhniraj08 For (1) when looking at the Tensorboard profile I dont see any operation happening on CPU. All operations are running on GPU.

Feb 15 '24 13:02 ndeep27

@ndeep27, Looking at the pmap output, I see tensorflow model server being loaded on CPU. This is expected as the package will be loaded on CPU and if the inferencing model utilizes GPU, model server will allocate the GPU memory for model inferencing only. Thanks.

Feb 16 '24 08:02 singhniraj08

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

Feb 24 '24 01:02 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

Mar 02 '24 01:03 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

Mar 02 '24 01:03 google-ml-butler[bot]