djl Inconsistent Model deploy & run performance

Description

Hi! I'm deploying a YOLOv5 object detector model in Java via DJL. My code for deploying is very similar to the code provided by this guide: https://docs.djl.ai/jupyter/load_pytorch_model.html.

I'm deploying the same model via the same code across two different machines. The machines are spec-ed as follows:

Macbook Pro 15" 2019: i9 9980HK, 32GB RAM, AMD 560 Graphics Card. Has integrated Intel graphics.
Fedora PC: i5 12600K (30% faster than the i9), 16GB RAM, no graphics card. No integrated graphics either. Has integrated Intel graphics.

The time to run the model is 25ms on the Macbook Pro and 800ms on the i5. This is a 30x difference.

If I understand correctly, the AMD graphics card does not have CUDA, so it's not used by DJL. I also confirmed that my GPU usage doesn't spike in usage when I run my script. So both machines are using CPU. I don't understand how the faster i5 processor can be 30x slower. Is it possible that the lack of the Intel integrated graphics is slowing down the i5?

Thank you!

Expected Behavior

Model run time should be largely consistent

Error Message

n/a

How to Reproduce?

this.pipeline.add(new Resize(OBJECT_DETECTION_FRAME_DIMENSION));
this.pipeline.add(new ToTensor());

this.translator = YoloV5Translator.builder().setPipeline(this.pipeline).optSynset(this.LABELS).build();

this.criteria = Criteria.builder().setTypes(Image.class, DetectedObjects.class)
.optModelPath(Paths.get(MODEL_DIRECTORY)).optProgress(new ProgressBar()).optTranslator(this.translator)
.build();

this.model = criteria.loadModel();
this.predictor = this.model.newPredictor();

Steps to reproduce

n/a

What have you tried to solve it?

n/a

Environment Info

Jan 12 '22 05:01 davpapp

@davpapp Can you try djl-bench to benchmark your model?

djl-bench -e PyTorch -p /home/ubuntu/models/pytorch/yolo5/yolo5.pt -s 1,3,224,224

Jan 12 '22 05:01 frankfliu

[INFO ] - Number of inter-op threads is 8 [INFO ] - Number of intra-op threads is 16 [INFO ] - Load PyTorch (1.9.1) in 0.003 ms. [INFO ] - Running Benchmark on: cpu(). [WARN ] - Simple repository pointing to a non-archive file. Loading: 100% |████████████████████████████████████████| [INFO ] - Model mymodel loaded in: 276.097 ms. [INFO ] - Inference result: [5.824645, 6.2446613, 14.122648 ...] [INFO ] - Throughput: 1.66, completed 1 iteration in 602 ms. [INFO ] - Model loading time: 276.097 ms.

For comparison, I also ran the benchmark on the YOLOv5 nano model, which is supposed to take <40ms on CPU. Even this performed way slower than expected: [INFO ] - Number of inter-op threads is 8 [INFO ] - Number of intra-op threads is 16 [INFO ] - Load PyTorch (1.9.1) in 0.008 ms. [INFO ] - Running Benchmark on: cpu(). [WARN ] - Simple repository pointing to a non-archive file. Loading: 100% |████████████████████████████████████████| [INFO ] - Model yolov5n_exported loaded in: 189.215 ms. [INFO ] - Inference result: [4.559452, 5.84641, 8.824781 ...] [INFO ] - Throughput: 8.55, completed 1 iteration in 117 ms. [INFO ] - Model loading time: 189.215 ms.

Jan 12 '22 05:01 davpapp

You might want to run a bit more iterations:

djl-bench -e PyTorch -p /home/ubuntu/models/pytorch/yolo5/yolo5.pt -s 1,3,224,224 -c 500

But it looks like your model's latency is around 600 ms on CPU. PyTorch doesn't use MKL by default, that's might be a reason why it's slow on linux. PyTorch on CUDA should be a lot faster. If you get this number with djl-bench, most likely it's what you can get with pytorch engine. You can try enable mkldnn, but I'm not sure if that works for your model:

        System.setProperty("ai.djl.pytorch.use_mkldnn", "true");

Jan 12 '22 06:01 frankfliu

Thanks for the help!

Jan 12 '22 16:01 davpapp

I ended up installing updated media codecs from RPM Fusion (including x264) on my Desktop running Fedora and things are finally working at the speeds I was expecting: [INFO ] - Inference result: [5.824645, 6.2446613, 14.122648 ...] [INFO ] - Throughput: 195.62, completed 10000 iteration in 51120 ms. [INFO ] - Model loading time: 72.124 ms. [INFO ] - total P50: 4.772 ms, P90: 5.973 ms, P99: 7.846 ms [INFO ] - inference P50: 4.669 ms, P90: 5.806 ms, P99: 7.726 ms [INFO ] - preprocess P50: 0.025 ms, P90: 0.056 ms, P99: 0.078 ms [INFO ] - postprocess P50: 0.071 ms, P90: 0.113 ms, P99: 0.169 ms

Jan 13 '22 04:01 davpapp

Actually, this is even more interesting than I had thought. Despite the djlbench profile completing 10k iterations in 50 seconds (so 5ms), I'm finding that object detection still takes 800ms in my code when I run the 224x224 detector. However, if I run the 320x320 detector of my model, it takes 80 seconds to run 10k iterations in djlbench, but object detection via DJL in Java takes only 200ms. Is it possible that the image resizing is much more computationally intensive for the 224 model than for the 320 model?

Jan 13 '22 05:01 davpapp

@davpapp Regarding multi-threading, it's possible single-threaded is more efficient than multi-threaded (really depends on your model):

PyTorch use OMP to run inference (see log of inter-op/intra op), means it uses all CPUs for single inference pass, so the single inference latency is low
When run multi-threading, djl-bench disables the OMP, each thread only using one CPU (you can manually change OMP settings)
if both cases can max out CPU usage, the total throughput would be the same
The final throughput really depends on with threading model is more efficient

Jan 13 '22 16:01 frankfliu

@davpapp For the image processing performance, you can use DJL opencv extension. The JPEG decoding with opencv is much faster than Java ImageIO. see: https://github.com/deepjavalibrary/djl/tree/master/extensions/opencv

Jan 13 '22 16:01 frankfliu

Feel free to reopen this issue if you still have question

Dec 28 '22 17:12 frankfliu

djl djl copied to clipboard

Inconsistent Model deploy & run performance

Description

Expected Behavior

Error Message

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

Environment Info

djl
djl copied to clipboard