serving icon indicating copy to clipboard operation
serving copied to clipboard

Capture profile on tensorboard profiler not capturing the profiling data

Open ganand1 opened this issue 5 years ago • 24 comments

Describe the problem the feature is intended to solve

tensorboard profiler not working with latest tensorflow serving master commit

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux ubuntu 18.04
  • TensorFlow Serving installed from (source or binary):built from source
  • TensorFlow Serving version: master

Describe the problem

tensorboard profile not working with tensorflow serving built from master

Exact Steps to Reproduce

Steps: (1) Build tensorflow model server for master using below docker https://github.com/tensorflow/serving/blob/master/tensorflow_serving/tools/docker/Dockerfile.devel make sure tensorflow serving api is saved for installing in the client

bazel build --color=yes --curses=yes \
    ${TF_SERVING_BAZEL_OPTIONS} \
    --verbose_failures \
    --output_filter=DONT_MATCH_ANYTHING \
    ${TF_SERVING_BUILD_OPTIONS} \
    tensorflow_serving/tools/pip_package:build_pip_package && \
    bazel-bin/tensorflow_serving/tools/pip_package/build_pip_package \
    /tmp/pip

(2) deploy tensorflow model server using docker https://github.com/tensorflow/serving/blob/master/tensorflow_serving/tools/docker/Dockerfile (3) Create a client virtual environment and install tensorflow-serving-api built from (1) (4) start tensorboard and go to profile on tensorboard (5) from the client run the model deployed (6) click on capture profile while model is running - no profile data - only empty profile in the logs directory of tensorboard.

Same steps using r2.3 tensorflow serving works fine.

Describe alternatives you've considered

Tried installing tf-nightly / tb-nightly / tbp-nightly didnt help Built tensorflow serving api by updating https://github.com/tensorflow/serving/blob/master/tensorflow_serving/tools/pip_package/setup.py#L47 to _TF_REQ = ['tf-nightly'] this installs tf-nightly tb-nightly but still capture profile data is empty

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

ganand1 avatar Oct 23 '20 01:10 ganand1

@ganand1, Can you please refer this Documentation on how to use Tensorboard Profiler while performing Inference using TF Serving. Thanks!

rmothukuru avatar Oct 30 '20 06:10 rmothukuru

TThank you for the link. I did follow the link and try. On clicking on cpature profile it shows up empty profile- doesnt capture profile properly

If i use 2.3 it works fine

When i use master, capture profile is empty.

ganand1 avatar Oct 30 '20 17:10 ganand1

@ganand1, For Tensorflow Serving, since the Latest Stable Version is 2.3.0,AFAIK, it is recommended to use that version. Once the internal testing is done on the Master Branch, a Stable Version will be cut later.

rmothukuru avatar Nov 02 '20 07:11 rmothukuru

Thanks. but when we use master code - we are not able to capture profile. Any suggestions how to get profiling data when tensorboard is not capturing the profile.

thank you

ganand1 avatar Nov 02 '20 18:11 ganand1

We today fixed a bug in the profiler side (https://github.com/tensorflow/tensorflow/commit/2baf9981d73fb3d12ba4331a3b3f5d82894e6aae). Can you again try this on TF master in ~1 day (once TF Serving catches up with the above TF commit)?

netfs avatar Dec 02 '20 04:12 netfs

@yisitu do you know if https://github.com/tensorflow/tensorflow/commit/2baf9981d73fb3d12ba4331a3b3f5d82894e6aae will fix this issue (seen on master TF serving repo, but not on 2.3 release)?

netfs avatar Dec 02 '20 04:12 netfs

Hi, I believe it will, but have not verified it.

yisitu avatar Dec 02 '20 05:12 yisitu

i tried running on the 2.4.0 i couldnt capture profile.[https://github.com/tensorflow/serving/tree/2.4.0] I had used the branch and built using https://github.com/tensorflow/serving/blob/master/tensorflow_serving/tools/docker/Dockerfile.devel

please find the pip. list wondering why 2.4.0 release is pointing to tensorflow 2.4.0rc4. piplist.txt

installing from pypi seems fine. I do see the tagged version

ganand1 avatar Dec 17 '20 22:12 ganand1

installing from pypi seems fine. I do see the tagged version

Could you elaborate? Do you mean that with an installation from pypi you can capture a profile?

yisitu avatar Dec 17 '20 23:12 yisitu

so i did pip install tensorflow-serving-api and that has the versions in the attached pip list piplist_pypi.txt - this looks fine to me.

still profile didnt work - i got empty profiles

ganand1 avatar Dec 17 '20 23:12 ganand1

I'm using official tensorflow/serving:2.4.0-gpu and I can't capture profiling data either. Tensorboard created events.out.tfevents.1608174858.faf5bf56d686.profile-empty in the logdir, prompted that Capture profile successfully, please refresh and shows "No profile data was found." after an automatic page refresh.

BorisPolonsky avatar Dec 22 '20 08:12 BorisPolonsky

same problem here, but I am on tf serving 2.1.0, is it a known bug for older versions?

Jxic avatar Jan 28 '21 12:01 Jxic

same problem here, but I am on tf serving 2.1.0, is it a known bug for older versions?

more info, tensorflow/serving:2.1.0, tensorboard 2.4.1 on clicking capture, it says something like Capture profile successfully, but generate file events.out.tfevents.1611836545.VM-28-227-centos.profile-empty

Jxic avatar Jan 28 '21 12:01 Jxic

it has been happening since 2.3.0 as far as i tried... you might have to use tensorflow 2.1.0 with tensorboard 2.1.0 ..will be good to keep the pypi packages version same.

ganand1 avatar Jan 28 '21 18:01 ganand1

it has been happening since 2.3.0 as far as i tried... you might have to use tensorflow 2.1.0 with tensorboard 2.1.0 ..will be good to keep the pypi packages version same.

Thanks for your suggestions! it did capture some useful information as I switched my tensorboard and ternsorflow to previous versions, but then it failed to show anything on the screen, and all I got was a blank page under the profile tab. So I tried to load the trace data using tensorboard 2.4.1 and somehow it works. Kinda weird that I have to switch between versions back and forth to get something done. But still, appreciate your advice!

Jxic avatar Jan 29 '21 02:01 Jxic

Hi, could you reassign the case to me? Thanks!

yisitu avatar Mar 16 '21 01:03 yisitu

Hi, sorry for the late reply. I think the missing step is to bind a host volume location where tensorboard expects the profile data to be written to.

(venv_TfModelServer) yisitu@a-yisitu-tensorflow-serving:~$ docker run -it --rm -p 8500:8500  -p 8501:8501 \
 --network=host \
+ -v /tmp/tensorboard:/tmp/tensorboard \
 -v /tmp/tfs:/tmp/tfs \
 -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" -e MODEL_NAME=half_plus_two yisitu/mybuild2 

Then start tensorboard and point it to the mounted host directory.

(venv_tensorboard) yisitu@a-yisitu-tensorflow-serving:~$ tensorboard \
+--logdir=/tmp/tensorboard \
 --bind_all --logtostderr --verbosity=9

With that I was able to get get a profile (image below) with the setup described in comment 1:

tensorboard            2.4.1
tensorboard-plugin-wit 1.8.0
tensorflow             2.4.1
tensorflow-estimator   2.4.0
tensorflow-serving-api 0.0.0

profile

yisitu avatar Mar 16 '21 02:03 yisitu

Hi, sorry for the late reply. I think the missing step is to bind a host volume location where tensorboard expects the profile data to be written to.

(venv_TfModelServer) yisitu@a-yisitu-tensorflow-serving:~$ docker run -it --rm -p 8500:8500  -p 8501:8501 \
 --network=host \
+ -v /tmp/tensorboard:/tmp/tensorboard \
 -v /tmp/tfs:/tmp/tfs \
 -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" -e MODEL_NAME=half_plus_two yisitu/mybuild2 

Then start tensorboard and point it to the mounted host directory.

(venv_tensorboard) yisitu@a-yisitu-tensorflow-serving:~$ tensorboard \
+--logdir=/tmp/tensorboard \
 --bind_all --logtostderr --verbosity=9

With that I was able to get get a profile (image below) with the setup described in comment 1:

tensorboard            2.4.1
tensorboard-plugin-wit 1.8.0
tensorflow             2.4.1
tensorflow-estimator   2.4.0
tensorflow-serving-api 0.0.0

profile

I mounted /home via -v /home:/home when launching both containers listed below:

  • container (with profiler installed via pip) from image tensorflow/tensorflow:2.4.1-gpu, for launching tensorboard
  • container from image tensorflow/serving:2.4.1-gpu for launching model server).

and connected the two container to the same network via docker connect. Then I specified --logdir as some sub-path under /home, profile service url as <tensorflow-serving-container-name>:8500. The pages prompts please refresh and no data was displayed after clicking capture. All I got is a events.out.tfevents.1616482451.d4794e6c1ed5.profile-empty file created by tensorboard. I've also tried using previous versions of tensorflow & serving without success.

Do I have to mount -v /tmp/tfs:/tmp/tfs for both containers besides connecting them to the same docker network?

BorisPolonsky avatar Mar 23 '21 06:03 BorisPolonsky

Do I have to mount -v /tmp/tfs:/tmp/tfs for both containers besides connecting them to the same docker network? No I don't think that is necessary.

Could you login to the docker, write a file to /tmp/tensorboard to see if that file is visible (to TensorBoard) outside of docker.

It also could be that your capture duration is too small.

Also check that you have these versions: tensorboard 2.4.1 tensorboard-plugin-wit 1.8.0 tensorflow 2.4.1 tensorflow-estimator 2.4.0 tensorflow-serving-api 0.0.0

yisitu avatar Mar 31 '21 15:03 yisitu

Do I have to mount -v /tmp/tfs:/tmp/tfs for both containers besides connecting them to the same docker network? No I don't think that is necessary.

Could you login to the docker, write a file to /tmp/tensorboard to see if that file is visible (to TensorBoard) outside of docker.

It also could be that your capture duration is too small.

Also check that you have these versions: tensorboard 2.4.1 tensorboard-plugin-wit 1.8.0 tensorflow 2.4.1 tensorflow-estimator 2.4.0 tensorflow-serving-api 0.0.0

Could you login to the docker, write a file to /tmp/tensorboard to see if that file is visible (to TensorBoard) outside of docker

No it isn't. (Since I didn't mount /tmp/tenosorboard for both both containers.)

It also could be that your capture duration is too small.

No, I set the duration to 5 seconds and the pages refreshes as soon as I send the request. I've tested multiple times already.

I would expect that if I requests between the two containers (the one that launches tensorboard and the other running the model server) could be delivered to each other via docker network configuration, then the profiling should work, without sharing a specific volume (in this case /tmp/tensorboard. Please clarify if the transmission of profiling data between the containers relies on writing/reading intermediate data to/from shared disk storage besides network connection.

BorisPolonsky avatar Apr 06 '21 01:04 BorisPolonsky

I have used the steps mentioned here for profiling tfserve but I am getting 'Failed to capture profile: empty trace result.' Though 'events.out.tfevents.1617711931.**.profile-empty' is created but events are not getting created.

Tensorboard error logs for it

2021-04-06 12:47:33.467885: I tensorflow/core/profiler/rpc/client/capture_profile.cc:198] Profiler delay_ms was 0, start_timestamp_ns set to 1617713253467873633 [2021-04-06T12:47:33.467873633+00:00]
Starting to trace for 1000 ms. Remaining attempt(s): 0
2021-04-06 12:47:33.467933: I tensorflow/core/profiler/rpc/client/remote_profiler_session_manager.cc:75] Deadline set to 2021-04-06T12:48:33.459801425+00:00 because max_session_duration_ms was 60000 and session_creation_timestamp_ns was 1617713253459801425 [2021-04-06T12:47:33.459801425+00:00]
2021-04-06 12:47:33.468318: I tensorflow/core/profiler/rpc/client/profiler_client.cc:113] Asynchronous gRPC Profile() to localhost:6006
2021-04-06 12:47:33.468386: I tensorflow/core/profiler/rpc/client/remote_profiler_session_manager.cc:96] Issued Profile gRPC to 1 clients
2021-04-06 12:47:33.468403: I tensorflow/core/profiler/rpc/client/profiler_client.cc:131] Waiting for completion.
E0406 12:47:33.469405 139666310461184 _internal.py:113] 127.0.0.1 - - [06/Apr/2021 12:47:33] code 505, message Invalid HTTP version (2.0)
2021-04-06 12:47:33.469757: E tensorflow/core/profiler/rpc/client/profiler_client.cc:154] Unknown: Stream removed
2021-04-06 12:47:33.469786: W tensorflow/core/profiler/rpc/client/capture_profile.cc:133] No trace event is collected from localhost:6006
2021-04-06 12:47:33.469795: W tensorflow/core/profiler/rpc/client/capture_profile.cc:145] localhost:6006 returned Unknown: Stream removed
No trace event is collected after 4 attempt(s). Perhaps, you want to try again (with more attempts?).
Tip: increase number of attempts with --num_tracing_attempts.```

dkalim avatar Apr 06 '21 12:04 dkalim

No it isn't. (Since I didn't mount /tmp/tenosorboard for both both containers.)

The design does not work the way you have described. The serving container produces profile data, which is written to disk. TensorBoard needs to load this data from disk. Since your setup has TensorBoard running in another container, you'll need to make sure that the location of the profile data (written by the serving container) to is also readable by the TensorBoard container.

I suggest mounting the host's /tmp/tensorboard (as described above) to both serving and tensorboard containers and trying again.

yisitu avatar Apr 08 '21 18:04 yisitu

No it isn't. (Since I didn't mount /tmp/tenosorboard for both both containers.)

The design does not work the way you have described. The serving container produces profile data, which is written to disk. TensorBoard needs to load this data from disk. Since your setup has TensorBoard running in another container, you'll need to make sure that the location of the profile data (written by the serving container) to is also readable by the TensorBoard container.

I suggest mounting the host's /tmp/tensorboard (as described above) to both serving and tensorboard containers and trying again.

I tried creating a docker volume and mounted it as /tmp/tensorboard for both containers (BTW for container that launches tensorboard I specified read-only option when mounting /tmp/tensorboard) this time. I have confirmed that the two containers now shares /tmp/tensorboard by manually crating test files. However I did not see anything written to /tmp/tensorboard after specifying tensorflow-serving:8500(I assume that this should be the correct url and port, if I specify any port/url other than this tensorboard will immediately prompt "empty trace result" instead of "capture profile successfully") and clicking "capture". No profiling data was displayed in tensorboard as described in previous posts.

BorisPolonsky avatar Apr 13 '21 02:04 BorisPolonsky

It's a simple understanding issue. TB is expecting an event in 1000ms (set while setting up profile service url) but no event is not getting triggered in that duration. You can try to increase this to 10000ms and once you press capture, you need to do a prediction call from the model within the next 10s. This should solve the error.

dkalim avatar Aug 12 '21 08:08 dkalim

@BorisPolonsky @ganand1 Did you folks ever solve this? I'm having similar issues seeing any profiling data with tf serving > 2.3

kyle-jarvis avatar Oct 31 '22 11:10 kyle-jarvis

Silly mistake in my case - in case anyone comes here in the future struggling, I found it useful to run a setup like this

#!/bin/bash

SCRIPT_DIR=$(dirname $(readlink -f $0))

TENSORFLOW_SERVING_VERSION=2.7.4
TENSORFLOW_VERSION_FOR_PROFILER=2.7.4

DOCKER_TF_SERVING_HOSTNAME=serving
TF_SERVING_MODEL_NAME=<YOUR_MODEL_NAME>

LOCAL_DIR_FOR_BIND_MOUNTS=tf_logs_dir
TMP_TF_LOGDIR=${SCRIPT_DIR}/${LOCAL_DIR_FOR_BIND_MOUNTS}

# Create directory to be mounted to profiler and tf serving container
MKDIR_CMD="mkdir -p ${TMP_TF_LOGDIR}"
echo $MKDIR_CMD
eval $MKDIR_CMD

# Create dockerfile for profiler:
DOCKER_PROFILER_TAG=tensorboard_profiler:latest

echo """FROM tensorflow/tensorflow:${TENSORFLOW_VERSION_FOR_PROFILER}

RUN pip install -U tensorboard-plugin-profile

ENTRYPOINT ["tensorboard", "--logdir", "/tmp/tensorboard", "--bind_all"]
""">Dockerfile_tfprofile

# Build the profiler image
docker build -t ${DOCKER_PROFILER_TAG} -f Dockerfile_tfprofile .

# Create docker compose file
echo """version: "3.3"
services:
  ${DOCKER_TF_SERVING_HOSTNAME}:
    image: tensorflow/serving:${TENSORFLOW_SERVING_VERSION}
    ports:
      - '8510:8500' # Whatever you need for client
      - '8511:8501' # Whatever you need for client
    environment:
      - MODEL_NAME=${TF_SERVING_MODEL_NAME}
    hostname: '${DOCKER_TF_SERVING_HOSTNAME}'
    volumes:
      - '<LOCAL_MODEL_DIR>:/models/${TF_SERVING_MODEL_NAME}/1'
      - '${TMP_TF_LOGDIR}:/tmp/tensorboard'
  profiler:
    image: ${DOCKER_PROFILER_TAG}
    ports:
      - '6006:6006'
    volumes:
      - '${TMP_TF_LOGDIR}:/tmp/tensorboard'
""">docker-compose.yaml

docker-compose up

# Profile url = ${DOCKER_TF_SERVING_HOSTNAME}:8500
# Set profile duration
# Make request within duration

kyle-jarvis avatar Nov 03 '22 10:11 kyle-jarvis

@ganand1,

Please refer to above comment which shows working setup to get profiling data from TF serving to tensorboard. Please let us know if this works.

Thank you!

singhniraj08 avatar Nov 07 '22 09:11 singhniraj08

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!

singhniraj08 avatar Jan 27 '23 06:01 singhniraj08