serve icon indicating copy to clipboard operation
serve copied to clipboard

Preprocess performance

Open BraginIvan opened this issue 3 years ago • 2 comments

📚 The doc issue

In the default image processing handler you process images one by one https://github.com/pytorch/serve/blob/a4d5090e114cdbeddf5077a817a8cd02d129159e/ts/torch_handler/vision_handler.py#L38 it works synchronously. What is the best way to optimize it, should I use Pool here or there is a better way?

It happens that if I use batch size = 1 it preprocesses faster but doesn't utilize gpu because of small BS, if I setup BS=128 it preprocesses (resize and other things) 128 images too slow and whole pipeline becomes 2 times slower, but gpu utilization sometimes (when batch is ready) goes to 90% As far as I understand min-workers and max-workers means number of processes for separate batches but I cant parallelize preprocessing in default configuration.

Suggest a potential alternative/fix

No response

BraginIvan avatar Jun 22 '22 15:06 BraginIvan

Thank you for your feedback @BraginIvan this is something that we're working to improve. We had a couple of prototype PRs like #1641 or #1545 to improve our story here but if you have any requirements or thoughts please let me know. At a high level my thinking is either

  1. Integrate more preprocessing libraries like DALI, ffcv, accimage
  2. Instead of iterating over rows of data, just instantiate a tensor directly on GPU

The decision we'll take is benchmark dependent there's a few quirks with each of these

  1. DALI: requires changes to our backend, they bundle decoding, preprocessing optimizations and a data loader - hard to pick just 1
  2. ffcv: requires a batch data transform offline so to me it seemed better suited for training than inference
  3. accimage: was much faster see benchmarks #1545 but wasn't clear what the long term maintenance plan of the project is
  4. Leverage more optimizations in torch/vision which had some known issues https://github.com/pytorch/vision/issues/3848
  5. Integrate directly optimizations in TS or at the very least don't do known to be slow things like torch.stack

msaroufim avatar Jun 22 '22 17:06 msaroufim

@BraginIvan We are investigating potential solutions for parallel preprocessing multiple images in one video frame.

  • solution1: parallel preprocessing multiple images out of TS by using pipeline.
  • solution2: optimize handler by using multiprocessing to parallel preprocessing.

lxning avatar Jul 08 '22 18:07 lxning