serve
serve copied to clipboard
question to model inference optimization
📚 The doc issue
there is a typo: A larger batch size means a higher throughput at the cost of lower latency.
correct version should be: A larger batch size means a higher throughput at the cost of higher latency.
i have some more questions to model inference latency optimization: im currently reading about: https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-cpu- https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#torchserve-on-gpu https://github.com/pytorch/serve/blob/master/docs/configuration.md https://huggingface.co/docs/transformers/en/perf_torch_compile
i currently running model inference for a SetFit model (https://huggingface.co/blog/setfit) on a ml.g4dn.xlarge instance on aws (vCPUs: 4, Memory (GiB): 16.0, Memory per vCPU (GiB): 4.0, Physical Processor: Intel Xeon Family, GPU: 1, GPU Architecture: nvidia t4 tensor core, Video Memory (GiB): 16).
one thing which helped was to use torch.compile with mode="reduce-overhead"
im not sure how you set all these parameters to tune for low latency, high throughput:
* `min_worker` - (optional) the minimum number of worker processes. TorchServe will try to maintain this minimum
* for specified model. The default value is `1`.
* `max_worker` - (optional) the maximum number of worker processes. TorchServe will make no more that this
* number of workers for the specified model. The default is the same as the setting for `min_worker`.
saw also other settings:
number_of_netty_threads: defines the number of threads to accept incoming http requests from your client container.
job_queue_size: defines the size of a models's job queue which stores incoming http requests.
default_workers_per_model: defines the number of workers which fetches a http request from a model's job queue.
netty_client_threads: defines the number of threads of a model's worker to receive http response from a model's
worker backend in TS internal.
i measured that a single model inference takes about 20ms. i want to have a max latency of around 50m. so i set the max_batch_delay to 30ms and set the max_batch_size to 100 (which seems a bit high at the moment).
how to set min_worker, max_worker - should that set to the number of cpu cores? should i also increase the default_workers_per_model? also, does BetterTransformer work with setfit models as well?
i have not used a profiler yet - just looking to understand all those settings before.
Suggest a potential alternative/fix
No response
Hi @geraldstanje You will have to use the benchmarking tool as shown in this example https://github.com/pytorch/serve/tree/master/examples/benchmarking/resnet50 You can refer to the yaml file to see the various options it runs the experiments for.
@agunapal can you run torch compile in the init function for torchServe any problems with that? e.g. here: https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py#L34
do you have an example somewhere?
Hi @geraldstanje You can download this mar file. Here we use torch.compile with BERT https://github.com/pytorch/serve/blob/master/benchmarks/models_config/bert_torch_compile_gpu.yaml#L24
@agunapal ok - i want to look whats inside the .mar file - will i need https://github.com/pytorch/serve/blob/master/model-archiver/README.md ?
You can wget the mar file and then do an unzip
@agunapal ok that worked - what does the self.model.eval() before the torch.compile in initialize of TransformersSeqClassifierHandler?
what could be the reason that torch.compile doesnt immediately complete?
it seems torch.compile requres some warmup requests to run (not sure if thats specific to mode=reduce-overhead only) - can you run this in the initialize as well - do you see any problems if the entire warmup takes longer than 30 sec?
Eval may not be needed.
Torch.compile first iteration can take time..so..usually you need to send a few(3-4) requests to warm up.
You can also check how we address this with aot compile. You can find the example under pt2 examples directory
@agunapal the problem is seems its doing some lazy execution - i run torch.compile - it seems to stop there - if i send request for predict it runs torch.compile ... how to disable lazy execution?
or how to check if lazy execution causes such behavior?