serve question to model inference optimization

📚 The doc issue

there is a typo: A larger batch size means a higher throughput at the cost of lower latency. correct version should be: A larger batch size means a higher throughput at the cost of higher latency.

i have some more questions to model inference latency optimization: im currently reading about: https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-cpu- https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-gpu https://github.com/pytorch/serve/blob/master/docs/configuration.md https://huggingface.co/docs/transformers/en/perf_torch_compile

i currently running model inference for a SetFit model (https://huggingface.co/blog/setfit) on a ml.g4dn.xlarge instance on aws (vCPUs: 4, Memory (GiB): 16.0, Memory per vCPU (GiB): 4.0, Physical Processor: Intel Xeon Family, GPU: 1, GPU Architecture: nvidia t4 tensor core, Video Memory (GiB): 16).

one thing which helped was to use torch.compile with mode="reduce-overhead"

im not sure how you set all these parameters to tune for low latency, high throughput:

* `min_worker` - (optional) the minimum number of worker processes. TorchServe will try to maintain this minimum
* for specified model. The default value is `1`.
* `max_worker` - (optional) the maximum number of worker processes. TorchServe will make no more that this
* number of workers for the specified model. The default is the same as the setting for `min_worker`.
saw also other settings: 
number_of_netty_threads: defines the number of threads to accept incoming http requests from your client container.
job_queue_size: defines the size of a models's job queue which stores incoming http requests.
default_workers_per_model: defines the number of workers which fetches a http request from a model's job queue.
netty_client_threads: defines the number of threads of a model's worker to receive http response from a model's
worker backend in TS internal.

i measured that a single model inference takes about 20ms. i want to have a max latency of around 50m. so i set the max_batch_delay to 30ms and set the max_batch_size to 100 (which seems a bit high at the moment).

how to set min_worker, max_worker - should that set to the number of cpu cores? should i also increase the default_workers_per_model? also, does BetterTransformer work with setfit models as well?

i have not used a profiler yet - just looking to understand all those settings before.

Suggest a potential alternative/fix

No response

May 04 '24 21:05 geraldstanje

Hi @geraldstanje You will have to use the benchmarking tool as shown in this example https://github.com/pytorch/serve/tree/master/examples/benchmarking/resnet50 You can refer to the yaml file to see the various options it runs the experiments for.

May 07 '24 17:05 agunapal

@agunapal can you run torch compile in the init function for torchServe any problems with that? e.g. here: https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py#L34

do you have an example somewhere?

Jun 13 '24 19:06 geraldstanje

Hi @geraldstanje You can download this mar file. Here we use torch.compile with BERT https://github.com/pytorch/serve/blob/master/benchmarks/models_config/bert_torch_compile_gpu.yaml#L24

Jun 13 '24 19:06 agunapal

@agunapal ok - i want to look whats inside the .mar file - will i need https://github.com/pytorch/serve/blob/master/model-archiver/README.md ?

Jun 13 '24 20:06 geraldstanje

You can wget the mar file and then do an unzip

Jun 13 '24 20:06 agunapal

@agunapal ok that worked - what does the self.model.eval() before the torch.compile in initialize of TransformersSeqClassifierHandler?

what could be the reason that torch.compile doesnt immediately complete?

it seems torch.compile requres some warmup requests to run (not sure if thats specific to mode=reduce-overhead only) - can you run this in the initialize as well - do you see any problems if the entire warmup takes longer than 30 sec?

Jun 13 '24 20:06 geraldstanje

Eval may not be needed.

Torch.compile first iteration can take time..so..usually you need to send a few(3-4) requests to warm up.

You can also check how we address this with aot compile. You can find the example under pt2 examples directory

Jun 13 '24 21:06 agunapal

@agunapal the problem is seems its doing some lazy execution - i run torch.compile - it seems to stop there - if i send request for predict it runs torch.compile ... how to disable lazy execution?

or how to check if lazy execution causes such behavior?

Jun 14 '24 00:06 geraldstanje

serve serve copied to clipboard

question to model inference optimization

📚 The doc issue

Suggest a potential alternative/fix

serve
serve copied to clipboard