serving [Bug] libcupti related error in official tensorflow/serving image prevents tensorboard profiler from gathering GPU usage

Bug Report

If this is a bug report, please fill out the following form in full:

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
TensorFlow Serving installed from (source or binary): official docker image: tensorflow/serving-2.3.0-gpu
TensorFlow Serving version: 2.3.0-gpu

Describe the problem

I want to profile GPU usage of a model served in GPU version of tensorflow/serving container with tensorboard. However, after clikcing the "caputure" button in tensorboard UI, it prompts that <tensorflow-serving-container-id>: Failed to load libcupti (is it installed and accessible?) , and I got the profiling results on CPU only without GPU related data displayed. It turns out that /usr/local/cuda/extras/CUPTI/lib64 (path to libcupti) is not included in $LD_LIBRARY_PATH in official tensorflow/serving image by default. I tried launching container from tensorflow/serving:2.3.0-gpu with -e /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 and it prompts <tensorflow-serving-container-id>: Insufficient privilege to run libcupti (you need root permission). instead, and no GPU usage was displayed either, despite that I've run both containers as root.

Exact Steps to Reproduce

For reproducing the first error that prompts ": Failed to load libcupti (is it installed and accessible?) ", do
execute docker run -d --name "${NAME}" -p 8500:8500 -p 8501:8501 -v "${PWD}/models:/models" -v /etc/localtime:/etc/localtime:ro -v "${PWD}/config:/etc/tensorflow-serving/config:ro" -v "${PWD}/batching_config:/etc/tensorflow-serving/batching_config:ro" --gpus all ${IMAGE} --model_config_file=/etc/tensorflow-serving/config --enable_batching=true --batching_parameters_file=/etc/tensorflow-serving/batching_config
- Launch tensorboard in another container (I used tensorflow/tensoflow:2.3.0-gpu with tensorboard-plugin-profile installed)with --link for routing grpc traffics to container tensorflow-serving
- continuously send prediction request to the model
- Click "capture" and enter grpc port and url of tensorflow/serving
For reproducing the latter error,
- Change the command in the first step to docker run -d --name "${NAME}" -p 8500:8500 -p 8501:8501 -v "${PWD}/models:/models" -v /etc/localtime:/etc/localtime:ro -v "${PWD}/config:/etc/tensorflow-serving/config:ro" -v "${PWD}/batching_config:/etc/tensorflow-serving/batching_config:ro" -e LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 --gpus all ${IMAGE} --model_config_file=/etc/tensorflow-serving/config --enable_batching=true --batching_parameters_file=/etc/tensorflow-serving/batching_config (i.e. overriding the LD_LIBRARY_PATH with -e option)
- Follow the rest of the steps for replicating the previous error.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Aug 13 '20 07:08 BorisPolonsky

I think this is related to: https://github.com/tensorflow/tensorflow/issues/2626

Aug 25 '20 02:08 calhamd

I think this is related to: tensorflow/tensorflow#2626

Thanks for the reply. I've gone through the post and it doesn't seem to be the problem on tensorflow's side, yet it may be something similar in this repo. I launched tensorflow/serving in a container and tensorboard in the tensorflow/tensorflow container (with --link option) respectively. The error message changes as I changed the ENV (i.e. LD_LIBRARY_PATH) in the tensorflow/serving changes. From my perspective the profiling is done in the tensorflow/serving container but not the tensorflow/tensorflow container. In /usr/local/cuda/extras/CUPTI/lib64 (the path I configured in $LD_LIBRARY_PATH$) of the tensorflow/serving image, the following files exists:

libcupti.so -> libcupti.so.10.1
libcupti.so.10.1 -> libcupti.so.10.1.208
libcupti.so.10.1.208
libcupti_static.a
libnvperf_host.so
libnvperf_host_static.a
libnvperf_target.so

So the library is not missing in this case. Probably something wrong with the official dockerfile?

Aug 25 '20 03:08 BorisPolonsky

My configuration of $LD_LIBRARY_PATH via -e in my case looks exactly the same as this pull request, yet this workaround doesn't work in my case.

Aug 25 '20 03:08 BorisPolonsky

@BorisPolonsky, The issue here is that, since you want to use Tensorflow Serving GPU, the Docker Run Command should have the argument, runtime .

Sample command is shown below:

docker run --runtime=nvidia -p 8501:8501 \
--mount type=bind,\
source=/tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu,\
target=/models/half_plus_two \
  -e MODEL_NAME=half_plus_two -t tensorflow/serving:latest-gpu &

For more details on how to use TF Serving GPU, please refer these Github instructions.

Aug 31 '20 10:08 rmothukuru

@BorisPolonsky, The issue here is that, since you want to use Tensorflow Serving GPU, the Docker Run Command should have the argument, runtime .

Sample command is shown below:
docker run --runtime=nvidia -p 8501:8501 \
--mount type=bind,\
source=/tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu,\
target=/models/half_plus_two \
  -e MODEL_NAME=half_plus_two -t tensorflow/serving:latest-gpu &
For more details on how to use TF Serving GPU, please refer these Github instructions.

I don't think that's the source of the problem as I'm not using nvidia-docker 2 (since its deprecated). I'm using nvidia-docker with native GPU support that comes with the --gpus option, which works basically the same as the --runtime nvidia thing in nvidia-docker 2. The model-server itself works fine on GPU via this --gpus option for me. The GPU version of official tensorflow/tensorflow image (and other DL framework) works with this option too. I assume the problem lies in tensorflow/serving.

Aug 31 '20 15:08 BorisPolonsky

@BorisPolonsky, Can you please let us know what is the source of the information,

nvidia-docker 2 is deprecated, use Native GPU Support

because I don't find that information in Github Nvidia Docker Repo. Also, in the Official TF Serving Documentation, it is mentioned as

TIP: If you're running a GPU image, be sure to run using the NVIDIA runtime --runtime=nvidia.

So, using the argument, --runtime should be the right way to use GPU in TF Serving. Please clarify me if you feel something is not correct. Thanks!

Sep 02 '20 11:09 rmothukuru

TIP: If you're running a GPU image, be sure to run using the NVIDIA runtime --runtime=nvidia.

It was definetily declared as deprecated in the exactly wiki page you mentioned the day I made my last post. But it looks like they've updated the wiki recently (I've just checked again). They declared 1.0 and 2.0 as (deprecated) as the pages with that statement were there for over 6 months. I've found the commit history of their README.md as evidence. Note that the statement in line 17 declaring that "nvidia-docker2 is deprecated" was removed in this commit. They updated it yesterday. I've been using the "native-GPU-support" installation for over 6 months and GPU versions of docker images worked just fine. I can't tell if they've messed up the "Native-GPU support" version of nvidia-docker or it's on tensorflow/serving. I hope you guys can find out.

Sep 03 '20 07:09 BorisPolonsky

@BorisPolonsky, Can you please let us know what is the source of the information,

nvidia-docker 2 is deprecated, use Native GPU Support

because I don't find that information in Github Nvidia Docker Repo. Also, in the Official TF Serving Documentation, it is mentioned as

TIP: If you're running a GPU image, be sure to run using the NVIDIA runtime --runtime=nvidia.

So, using the argument, --runtime should be the right way to use GPU in TF Serving. Please clarify me if you feel something is not correct. Thanks!

@rmothukuru Hello, please find the official instruction by NVIDIA on the --gpu option here.

Sep 18 '20 03:09 BorisPolonsky

I tested again today and here's the log in tensorflow-serving container:

2020-11-04 10:46:30.997583: I tensorflow_serving/model_servers/server.cc:387] Exporting HTTP/REST API at:localhost:8501 ...
2020-11-04 10:50:50.816298: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 10:50:50.817199: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-11-04 10:50:50.823733: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcupti.so.10.1
2020-11-04 10:50:50.925046: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-11-04 10:50:52.073124: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 10:50:52.077957: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 10:50:52.078576: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-11-04 10:50:53.173021: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 10:50:53.177718: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 10:50:53.177884: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-11-04 10:50:54.332059: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 10:50:54.334406: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 10:50:54.334569: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-11-04 10:50:55.440025: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 10:51:43.447888: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 10:51:43.448056: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-11-04 10:51:48.898234: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 10:51:48.958310: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: ./plugins/profile/2020_11_04_02_51_43
2020-11-04 10:51:48.960894: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to ./plugins/profile/2020_11_04_02_51_43/tensorflow-serving.trace.json.gz
2020-11-04 11:02:07.148432: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 11:02:07.148549: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-11-04 11:02:12.638887: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 11:02:12.649252: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: ./plugins/profile/2020_11_04_03_02_07
2020-11-04 11:02:12.655819: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to ./plugins/profile/2020_11_04_03_02_07/tensorflow-serving.trace.json.gz
2020-11-04 11:38:08.845835: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 11:38:08.845979: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-11-04 11:38:15.390819: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 11:38:15.449769: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 11:38:15.449851: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-11-04 11:38:21.979389: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 11:38:21.984079: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 11:38:21.984233: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-11-04 11:38:28.528253: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 11:38:28.531801: I external/org_tensorflow/tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-11-04 11:38:28.531925: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-11-04 11:38:35.043900: I external/org_tensorflow/tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 0 callback api events and 0 activity events. 
2020-11-04 11:38:35.047809: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: ./plugins/profile/2020_11_04_03_38_08
2020-11-04 11:38:35.048064: I external/org_tensorflow/tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to ./plugins/profile/2020_11_04_03_38_08/tensorflow-serving.trace.json.gz

Nov 04 '20 03:11 BorisPolonsky

The same error . I run tensorboard --logdir="/mnt/d/zhaodachuan/data/ranking_model/model_output/20201214/dcn_test/model_callback/tensorboard" --host=0.0.0.0 --port=7001 to start tensorboard . It display error in tag PROFILE :

ERRORS
bigdata-gpu: Failed to load libcupti (is it installed and accessible?)
WARNINGS
No step marker observed and hence the step time is unknown. This may happen if (1) training steps are not instrumented (e.g., if you are not using Keras) or (2) the profiling duration is shorter than the step time. For (1), you need to add step instrumentation; for (2), you may try to profile longer.

I have install libcupti :

/sbin/ldconfig -N -v $(sed 's/:/ /g' <<< $LD_LIBRARY_PATH) | grep libcupti

outputs :

        libcupti.so.10.1 -> libcupti.so.10.1.208

Dec 14 '20 10:12 DachuanZhao

I updated both tensorflow/serving and tensorflow/tensorflow imge to 2.4.0-gpu, I got empty profile all the time. Tensorboard created events.out.tfevents.1608174858.faf5bf56d686.profile-empty in the logdir, prompted that Capture profile successfully, please refresh and shows No profile data was found. after an automatic page refresh.

Dec 17 '20 03:12 BorisPolonsky

Has anyone solved this issue? I am also getting this error. The LD_LIBRARY_PATH is incorrect in my TFServing container, using version 2.3. There is no nvidia folder

root@389958af6602:~# echo $LD_LIBRARY_PATH
/usr/local/nvidia/lib:/usr/local/nvidia/lib64

root@389958af6602:~# ls /usr/local/
bin  cuda  cuda-10.1  etc  games  include  lib  man  sbin  share  src

What should LD_LIBRARY_PATH be set to in the TFServing container?

This is how I'm building the TFServing container.

docker run -d --name serving_base_wildlife_gpu tensorflow/serving:2.3.0-gpu

Feb 18 '21 23:02 rbavery

I tried exporting LD_LIBRARY_PATH like so

root@389958af6602:~# export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64

since this folder contains libcupti.so.10.1

root@389958af6602:~# ls /usr/local/cuda/extras/CUPTI/lib64
libcupti.so           libcupti_static.a        libnvperf_target.so
libcupti.so.10.1      libnvperf_host.so
libcupti.so.10.1.208  libnvperf_host_static.a

but when I capture a profile I get a new error

2021-02-19 00:00:44.337578: E external/org_tensorflow/tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.

Not sure how to debug this or what to try next to get the profiler working with TFServing.

Feb 19 '21 00:02 rbavery

Any updates @minglotus-6 ? I tried this solution (see comment above) but it didn't work. https://github.com/tensorflow/tensorflow/issues/2626#issuecomment-261685470

Feb 26 '21 18:02 rbavery

@BorisPolonsky,

Can you please follow the guide Profile Inference Requests with TensorBoard for step by step process to setup tensorboard for your tensorflow/serving image. Also make sure the required prerequisites Tensorflow>=2.0.0 and TensorBoard (should be installed if TF was installed via pip) are met. Thank you!

Apr 14 '23 10:04 singhniraj08

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

Apr 22 '23 01:04 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

Apr 29 '23 01:04 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

Apr 29 '23 01:04 google-ml-butler[bot]

It's been 3 years and my job does not need me write any TF Code now...

Apr 29 '23 14:04 BorisPolonsky