server
server copied to clipboard
support auto padding for tensorflow_backend
xla is very fast, but it requires padding to respond to changes in online service requests
Hi @LinGeLin,
Can you please provide more details on the use case, as well as an example model+client to reproduce the current lack of support and show the bottlenecks?
Hi @LinGeLin,
Can you please provide more details on the use case, as well as an example model+client to reproduce the current lack of support and show the bottlenecks?
For example, max_batch_size is 1024. If xla is enabled for the model service with TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit", the speed can be nearly doubled. However, because it is jit, warmup can be complicated, and if the online request does not hit the warmup shape, it will time out immediately. Therefore, if tf-backend supports auto padding, padding fluctuating requests into a warmup shape will be very helpful.