serve
serve copied to clipboard
Preprocess performance
📚 The doc issue
In the default image processing handler you process images one by one https://github.com/pytorch/serve/blob/a4d5090e114cdbeddf5077a817a8cd02d129159e/ts/torch_handler/vision_handler.py#L38 it works synchronously. What is the best way to optimize it, should I use Pool here or there is a better way?
It happens that if I use batch size = 1 it preprocesses faster but doesn't utilize gpu because of small BS, if I setup BS=128 it preprocesses (resize and other things) 128 images too slow and whole pipeline becomes 2 times slower, but gpu utilization sometimes (when batch is ready) goes to 90% As far as I understand min-workers and max-workers means number of processes for separate batches but I cant parallelize preprocessing in default configuration.
Suggest a potential alternative/fix
No response
Thank you for your feedback @BraginIvan this is something that we're working to improve. We had a couple of prototype PRs like #1641 or #1545 to improve our story here but if you have any requirements or thoughts please let me know. At a high level my thinking is either
- Integrate more preprocessing libraries like DALI, ffcv, accimage
- Instead of iterating over rows of data, just instantiate a tensor directly on GPU
The decision we'll take is benchmark dependent there's a few quirks with each of these
- DALI: requires changes to our backend, they bundle decoding, preprocessing optimizations and a data loader - hard to pick just 1
- ffcv: requires a batch data transform offline so to me it seemed better suited for training than inference
- accimage: was much faster see benchmarks #1545 but wasn't clear what the long term maintenance plan of the project is
- Leverage more optimizations in
torch/visionwhich had some known issues https://github.com/pytorch/vision/issues/3848 - Integrate directly optimizations in TS or at the very least don't do known to be slow things like
torch.stack
@BraginIvan We are investigating potential solutions for parallel preprocessing multiple images in one video frame.
- solution1: parallel preprocessing multiple images out of TS by using pipeline.
- solution2: optimize handler by using multiprocessing to parallel preprocessing.