serving icon indicating copy to clipboard operation
serving copied to clipboard

tensorflow serving batch inference slow !!!!

Open sevenold opened this issue 5 years ago • 11 comments

Excuse me, how to solve the problem of slow speed? shape:(1, 32, 387, 1) data time: 0.005219221115112305 post time: 0.24771547317504883 end time: 0.2498164176940918 shape:(2, 32, 387, 1) data time: 0.0056378841400146484 post time: 0.4651315212249756 end time: 0.4693586826324463

docker run --runtime=nvidia -it --rm -p 8501:8501
-v "$(pwd)/densenet_ctc:/models/docker_test"
-e MODEL_NAME=docker_test tensorflow/serving:latest-gpu
--tensorflow_intra_op_parallelism=8
--tensorflow_inter_op_parallelism=8
--enable_batching=true
--batching_parameters_file=/models/docker_test/batching_parameters.conf

num_batch_threads { value: 4 } batch_timeout_micros { value: 2000} max_batch_size {value: 48} max_enqueued_batches {value: 48}

GPU:1080Ti Thanks.

sevenold avatar Nov 08 '19 04:11 sevenold

@sevenold, Can you please let us know what is the GPU Utilization during Serving. Problem might be low GPU Utilization.

Can you please try running the Container with the below parameters and let us know if it resolves your issue. Thanks!

--grpc_channel_arguments=grpc.max_concurrent_streams=1000
--per_process_gpu_memory_fraction=0.7
--enable_batching=true
--max_batch_size=10
--batch_timeout_micros=1000
--max_enqueued_batches=1000
--num_batch_threads=6
--batching_parameters_file=/models/flow2_batching.config
--tensorflow_session_parallelism=2 \ 

For more information, please refer #1440

rmothukuru avatar Nov 08 '19 06:11 rmothukuru

@rmothukuru I try running the Container with the below parameters but the same result.


docker run --runtime=nvidia -it --rm -p 8501:8501
-v "$(pwd)/densenet_ctc:/models/docker_test"
-e MODEL_NAME=docker_test tensorflow/serving:latest-gpu
--grpc_channel_arguments=grpc.max_concurrent_streams=1000
--per_process_gpu_memory_fraction=0.7
--enable_batching=true
--max_batch_size=128
--batch_timeout_micros=1000
--max_enqueued_batches=1000
--num_batch_threads=8
--batching_parameters_file=/models/docker_test/batching_parameters.conf
--tensorflow_session_parallelism=2


image it's also low GPU Utilization.


sevenold avatar Nov 08 '19 09:11 sevenold

@sevenold, Can you please confirm that you have gone through the issue, #1440 and issue still persists. If so, can you please share your Model so that we can reproduce the issue at our side. Thanks!

rmothukuru avatar Nov 08 '19 10:11 rmothukuru

@rmothukuru Thanks. google drive This is my model and client.

sevenold avatar Nov 11 '19 01:11 sevenold

@rmothukuru I tested my other models, such as the verification code recognition model, and the parameters are the same, it is normal to use gpu for prediction.Thanks!

sevenold avatar Nov 11 '19 02:11 sevenold

maybe you can try the grpc channel

leo-XUKANG avatar Nov 25 '19 02:11 leo-XUKANG

maybe you can try the grpc channel

I tried but the same result.

sevenold avatar Nov 26 '19 04:11 sevenold

Same question . Seems like tf serving predicts images tandem even I post multiple images one time.

RainZhang1990 avatar Dec 10 '19 06:12 RainZhang1990

what happens when you load up the model with TF? Do you get significantly better inference latency? your TF runtime requires X time to do a forward pass on your model on a batch of examples, X becomes a lower bound for your inference latency with TF Serving.

peddybeats avatar Jan 16 '20 00:01 peddybeats

I found that the serialization(of FP16 data) is of great overhead in the gRPC client API. And this heavily drops the QPS. And in my case, I use 3x224x244 as the data to be transferred. The serialization cost is 2 times as the server processing time in the ResNet50 model.

ganler avatar Apr 02 '20 10:04 ganler

Is this issue solved? I'm having the same problem when serving a OpenNMT tensorflow model. I have configured the --rest_api_num_threads=1000 and --grpc_channel_arguments=grpc.max_concurrent_streams=1000 they just won't work somehow, the tensorflow server keeps saying gRPC resource exhausted, I can't send more than 15 requests in concurrent threads.

owenljn avatar Sep 15 '21 20:09 owenljn

@oohx,

Could you please provide some more information for us to debug this issue? We would like to understand how the same model with same batching data performs in Tensorflow. Could you please share the latency of your model doing inference in TF runtime and same model doing inference in TF serving.

If your TF runtime requires X time to do a forward pass on your model on a batch of examples, X becomes a lower bound for your inference latency with TF Serving. Also, please refer to performance guide.

Thank you!

singhniraj08 avatar Feb 16 '23 07:02 singhniraj08

This issue was closed due to lack of activity after being marked stale for past 14 days.

github-actions[bot] avatar Mar 16 '23 02:03 github-actions[bot]