serving tensorflow serving batch inference slow !!!!

Excuse me, how to solve the problem of slow speed? shape：(1, 32, 387, 1) data time: 0.005219221115112305 post time: 0.24771547317504883 end time: 0.2498164176940918 shape:(2, 32, 387, 1) data time: 0.0056378841400146484 post time: 0.4651315212249756 end time: 0.4693586826324463

docker run --runtime=nvidia -it --rm -p 8501:8501
-v "$(pwd)/densenet_ctc:/models/docker_test"
-e MODEL_NAME=docker_test tensorflow/serving:latest-gpu
--tensorflow_intra_op_parallelism=8
--tensorflow_inter_op_parallelism=8
--enable_batching=true
--batching_parameters_file=/models/docker_test/batching_parameters.conf

num_batch_threads { value: 4 } batch_timeout_micros { value: 2000} max_batch_size {value: 48} max_enqueued_batches {value: 48}

GPU:1080Ti Thanks.

Nov 08 '19 04:11 sevenold

@sevenold, Can you please let us know what is the GPU Utilization during Serving. Problem might be low GPU Utilization.

Can you please try running the Container with the below parameters and let us know if it resolves your issue. Thanks!

--grpc_channel_arguments=grpc.max_concurrent_streams=1000
--per_process_gpu_memory_fraction=0.7
--enable_batching=true
--max_batch_size=10
--batch_timeout_micros=1000
--max_enqueued_batches=1000
--num_batch_threads=6
--batching_parameters_file=/models/flow2_batching.config
--tensorflow_session_parallelism=2 \

For more information, please refer #1440

Nov 08 '19 06:11 rmothukuru

@rmothukuru I try running the Container with the below parameters but the same result.

docker run --runtime=nvidia -it --rm -p 8501:8501
-v "$(pwd)/densenet_ctc:/models/docker_test"
-e MODEL_NAME=docker_test tensorflow/serving:latest-gpu
--grpc_channel_arguments=grpc.max_concurrent_streams=1000
--per_process_gpu_memory_fraction=0.7
--enable_batching=true
--max_batch_size=128
--batch_timeout_micros=1000
--max_enqueued_batches=1000
--num_batch_threads=8
--batching_parameters_file=/models/docker_test/batching_parameters.conf
--tensorflow_session_parallelism=2

it's also low GPU Utilization.

Nov 08 '19 09:11 sevenold

@sevenold, Can you please confirm that you have gone through the issue, #1440 and issue still persists. If so, can you please share your Model so that we can reproduce the issue at our side. Thanks!

Nov 08 '19 10:11 rmothukuru

@rmothukuru Thanks. google drive This is my model and client.

Nov 11 '19 01:11 sevenold

@rmothukuru I tested my other models, such as the verification code recognition model, and the parameters are the same, it is normal to use gpu for prediction.Thanks!

Nov 11 '19 02:11 sevenold

maybe you can try the grpc channel

Nov 25 '19 02:11 leo-XUKANG

maybe you can try the grpc channel

I tried but the same result.

Nov 26 '19 04:11 sevenold

Same question . Seems like tf serving predicts images tandem even I post multiple images one time.

Dec 10 '19 06:12 RainZhang1990

what happens when you load up the model with TF? Do you get significantly better inference latency? your TF runtime requires X time to do a forward pass on your model on a batch of examples, X becomes a lower bound for your inference latency with TF Serving.

Jan 16 '20 00:01 peddybeats

I found that the serialization(of FP16 data) is of great overhead in the gRPC client API. And this heavily drops the QPS. And in my case, I use 3x224x244 as the data to be transferred. The serialization cost is 2 times as the server processing time in the ResNet50 model.

Apr 02 '20 10:04 ganler

Is this issue solved? I'm having the same problem when serving a OpenNMT tensorflow model. I have configured the --rest_api_num_threads=1000 and --grpc_channel_arguments=grpc.max_concurrent_streams=1000 they just won't work somehow, the tensorflow server keeps saying gRPC resource exhausted, I can't send more than 15 requests in concurrent threads.

Sep 15 '21 20:09 owenljn

@oohx,

Could you please provide some more information for us to debug this issue? We would like to understand how the same model with same batching data performs in Tensorflow. Could you please share the latency of your model doing inference in TF runtime and same model doing inference in TF serving.

If your TF runtime requires X time to do a forward pass on your model on a batch of examples, X becomes a lower bound for your inference latency with TF Serving. Also, please refer to performance guide.

Thank you!

Feb 16 '23 07:02 singhniraj08

This issue was closed due to lack of activity after being marked stale for past 14 days.

Mar 16 '23 02:03 github-actions[bot]

serving serving copied to clipboard

tensorflow serving batch inference slow !!!!

Excuse me, how to solve the problem of slow speed? shape：(1, 32, 387, 1) data time: 0.005219221115112305 post time: 0.24771547317504883 end time: 0.2498164176940918 shape:(2, 32, 387, 1) data time: 0.0056378841400146484 post time: 0.4651315212249756 end time: 0.4693586826324463

num_batch_threads { value: 4 } batch_timeout_micros { value: 2000} max_batch_size {value: 48} max_enqueued_batches {value: 48}

serving
serving copied to clipboard