server icon indicating copy to clipboard operation
server copied to clipboard

support auto padding for tensorflow_backend

Open LinGeLin opened this issue 1 year ago • 2 comments

xla is very fast, but it requires padding to respond to changes in online service requests

LinGeLin avatar Jul 23 '24 14:07 LinGeLin

Hi @LinGeLin,

Can you please provide more details on the use case, as well as an example model+client to reproduce the current lack of support and show the bottlenecks?

rmccorm4 avatar Jul 31 '24 22:07 rmccorm4

Hi @LinGeLin,

Can you please provide more details on the use case, as well as an example model+client to reproduce the current lack of support and show the bottlenecks?

For example, max_batch_size is 1024. If xla is enabled for the model service with TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit", the speed can be nearly doubled. However, because it is jit, warmup can be complicated, and if the online request does not hit the warmup shape, it will time out immediately. Therefore, if tf-backend supports auto padding, padding fluctuating requests into a warmup shape will be very helpful.

LinGeLin avatar Aug 20 '24 07:08 LinGeLin