mobile_app_open icon indicating copy to clipboard operation
mobile_app_open copied to clipboard

Check Mobilenet V4 Large on iPhones

Open freedomtan opened this issue 1 year ago • 56 comments

Currently, I got

device Mobilenet V4 Large Mobilenet EdgeTPU
iPhone 13 220.11 617.78
iPhone 14 Pro 300.06 970.95
iPhone 15 Pro Max 332.95 1145.05

Roughly, > 300 qps for iPhone 13 should be possible.

https://github.com/mlcommons/mobile_app_open/pull/821#issuecomment-1976609360

freedomtan avatar Mar 10 '24 01:03 freedomtan

@freedomtan please share the info how to check the model accuracy for the Mobilenet V4. What dataset do I need to use, and if we have some specific steps to setup accuracy test on the iOS device. Thanks

RSMNYS avatar Mar 10 '24 20:03 RSMNYS

@freedomtan please share the info how to check the model accuracy for the Mobilenet V4. What dataset do I need to use, and if we have some specific steps to setup accuracy test on the iOS device. Thanks

To validate accuracy of image classification models, we use full ImageNet 2012 validation dataset (50, 000 images) from https://www.image-net.org/index.php.

freedomtan avatar Mar 11 '24 05:03 freedomtan

@freedomtan I've tried accuracy test for the CoreML backend, and TF backend for the Image Classification task v1, and v2. For each case it crashes after 100%. EXC_BAD_ACCESS (code=1, address=0x27c8) in compute accuracy. I'm going to check what the problem we have.

RSMNYS avatar Mar 22 '24 19:03 RSMNYS

@freedomtan I've tried accuracy test for the CoreML backend, and TF backend for the Image Classification task v1, and v2. For each case it crashes after 100%. EXC_BAD_ACCESS (code=1, address=0x27c8) in compute accuracy. I'm going to check what the problem we have.

I've found that validation results were expected in another format, that I had (so only the category number, without image name). I can run the accuracy test now, but it gives 0.05% of the accuracy, so might be again dataset issue. When tried the one from our tests it gives 100 %, but we have 10 images there only.

IMG_9972

RSMNYS avatar Mar 25 '24 20:03 RSMNYS

@RSMNYS I don't get it.

This is the original Mobielnet EdgeTPU model we had or a new V4 one? As far as I can remember, we checked that we can have expected accuracy number for the original one.

Please check

  1. run all the benchmark items with TFLite + CoreML Delegate backend as the baseline (optional)
  2. Mobilenet EdgeTPU's accuracy numbers (including both non-offline and offline ones)
  3. accuracy numbers of other models: as far as I can remember, all models except the MobileBERT should have good enought accuracy.

freedomtan avatar Mar 26 '24 03:03 freedomtan

FYR, on an iPhone 13, for Mobilenet EdgeTPU I got 76.21% running binary built from lastest master branch.

freedomtan avatar Mar 26 '24 05:03 freedomtan

Thanks, all works. For iPhone 14 Pro, I have the same 76.21%. Will try with the ImageNet V2 and different optimised models based on it.

RSMNYS avatar Mar 26 '24 06:03 RSMNYS

All tests were done on iPhone 14 Pro

Model Name Performance (QPS) Accuracy (%) Size, Mb
MobilenetV4_Large.mlmodel 268.58 81.82% 130.1
MobilenetV4_Large.mlpackage 251.36 82.73% 65.5
MobilenetV4_Large.mlpackage (8 bit quantization) 299.25 82.7% 33.3
MobilenetV4_Large.mlpackage (20% sparsity) 258.39 82.26% 56.6
MobilenetV4_Large.mlpackage (30% sparsity) 244.7 80.83% 50.1
MobilenetV4_Large.mlpackage (40% sparsity) 261.22 74.15% 43.6
MobilenetV4_Large.mlpackage (30% sparsity, 8 bit quantization) 299.4 80.83% 50.1

Also during the test I noticed the performance drop when device is warm (after several tests). And sometimes it drops from 300 to 200 qps. Please check also the screenshot, there you can see the tests for MobilenetV4_Large.mlpackage (8 bit quantization) only. You can see how the performance could differ. @freedomtan

IMG_0003

Here is the link to models: https://github.com/RSMNYS/mobile_models/tree/main/v4_0/CoreML

RSMNYS avatar Apr 02 '24 14:04 RSMNYS

@RSMNYS thermal throttling is a well-known issue on cell phone. A typical way to get numbers we want is to cool down the device before you run a new test :-)

freedomtan avatar Apr 05 '24 00:04 freedomtan

please try to do the first 3 items and ensure that there is not thermal throttling. e.g., cold start, wait for 5 mins, and measure the performance numbers.

Note that currently we don't allow model pruning (sparsity above) for submission. If we want to allow that, we need to change our rules.

freedomtan avatar Apr 09 '24 05:04 freedomtan

All tests were done on iPhone 14 Pro

Model Name Performance (QPS) Accuracy (%) Size, Mb
MobilenetV4_Large.mlmodel 294.85 81.2% 124
MobilenetV4_Large.mlpackage 296.93 82.73% 65.5
MobilenetV4_Large.mlpackage (8 bit quantization) 295.11 82.7% 33.3

RSMNYS avatar Apr 12 '24 05:04 RSMNYS

All tests were done on iPhone 14 Pro

Model Name Performance (QPS) Accuracy (%) Size, Mb MobilenetV4_Large.mlmodel 294.85 81.2% 124 MobilenetV4_Large.mlpackage 296.93 82.73% 65.5 MobilenetV4_Large.mlpackage (8 bit quantization) 295.11 82.7% 33.3

These numbers look reasonable now. But let's see if we can further improve it.

Let's check if @colbybanbury can comment on this.

freedomtan avatar Apr 16 '24 05:04 freedomtan

MobilenetV4 was made public last week, see https://arxiv.org/abs/2404.10518 or https://arxiv.org/html/2404.10518v1 According to numbers in the paper, it should be able to get > 300 qps for iPhone 13.

freedomtan avatar Apr 23 '24 05:04 freedomtan

The V4 paper results use an iPhone 13 and fp16 quantization. The model was also derived from a Pytorch equivalent in order to be in (batch, channel, height, width) tensor format which I measured to be slightly faster.

I recommend using fp16 on iPhones with a version number less than 15 pro where they added int8-int8 compute.

Happy to help if needed!

colbybanbury avatar Apr 23 '24 16:04 colbybanbury

@RSMNYS From the paper https://arxiv.org/abs/2404.10518

for benchmarks on the Apple Neural Engine (conducted on an iPhone 13 with iOS 16.6.1, CoreMLTools 7.1, and Xcode 15.0.1 for profiling), PyTorch models were converted to CoreML’s MLProgram format in Float16 precision, with float16 MultiArray inputs to minimize input copying

anhappdev avatar Apr 24 '24 07:04 anhappdev

@freedomtan can you point please where we can get the MobileNet V4 PyTorch model. As currently we have only tf lite one.

RSMNYS avatar Apr 25 '24 09:04 RSMNYS

The PyTorch model has yet to be officially released. Sorry for the delay!

The TensorFlow model should still get similar latency results, but let me know if I can help with anything.

colbybanbury avatar Apr 25 '24 13:04 colbybanbury

@freedomtan to try it on iPhone 13 again.

freedomtan avatar Apr 30 '24 05:04 freedomtan

@freedomtan to try it on iPhone 13 again.

As I got before, on iPhone 13, it's about 220 qps

freedomtan avatar May 06 '24 03:05 freedomtan

Let's try to have PyTorch model (with weights from the TensorFlow model).

freedomtan avatar May 07 '24 05:05 freedomtan

@colbybanbury can you please tell us if you use mlmodel or mlpackage CoreML models in your tests?

RSMNYS avatar May 14 '24 15:05 RSMNYS

I used MLPackage

colbybanbury avatar May 14 '24 16:05 colbybanbury

@RSMNYS With Xcode 16.0 beta and iOS 18 + MLPackage targeting iOS 15 or later, it's possible to get per-op time. Please check https://developer.apple.com/videos/play/wwdc2024/10161/?time=927

freedomtan avatar Jun 12 '24 07:06 freedomtan

Per-op profiling actually is possible on iOS 17.4+ / MacOS 14.4+. I wrote a little command line program and tested it on my Macbook Pro M1, see https://github.com/freedomtan/coreml_modelc_profling

freedomtan avatar Jun 13 '24 13:06 freedomtan

FWIW There's still no official weights from the paper authors, but I've trained a number of PyTorch native MobileNetV4 models and made them available in timm. The conv-medium runs quite nicely on CPU w/o much extra optimization. https://github.com/huggingface/pytorch-image-models?tab=readme-ov-file#june-12-2024

rwightman avatar Jun 13 '24 16:06 rwightman

FWIW There's still no official weights from the paper authors, but I've trained a number of PyTorch native MobileNetV4 models and made them available in timm. The conv-medium runs quite nicely on CPU w/o much extra optimization. https://github.com/huggingface/pytorch-image-models?tab=readme-ov-file#june-12-2024

@rwightman: FYI, thanks to @colbybanbury, one of the co-authors of the paper, we did have MobileNetV4-Conv-Large saved_model, and tflites, see https://github.com/mlcommons/mobile_open/tree/main/vision/mobilenetV4

freedomtan avatar Jun 14 '24 01:06 freedomtan

@RSMNYS pip install git+https://github.com/huggingface/pytorch-image-models.git then

import timm
import torch
import coremltools as ct

torch_model = timm.create_model("hf-hub:timm/mobilenetv4_conv_large.e600_r384_in1k", pretrained=True)
torch_model.eval()

# Trace the model with random data.
example_input = torch.rand(1, 3, 384, 384) 
traced_model = torch.jit.trace(torch_model, example_input)
out = traced_model(example_input)

model = ct.convert(
    traced_model,
    convert_to="mlprogram",
    inputs=[ct.TensorType(shape=example_input.shape)]
)

model.save("mobilenetv4.mlpackage")

This model takes around 3.10 ms (> 300 qps) on my iPhone 13. iPhone 14 Pro: 2.29 ms (436)

These matche what @colbybanbury and other said in the paper. Please try to see if we can get the same performance with the TF saved_model.

Thanks @rwightman

freedomtan avatar Jun 14 '24 02:06 freedomtan

@RSMNYS and @anhappdev According to coremltools 8.0b1 doc on quantization, it's possible to create a calibrated quantized A8W8 PTQ model from an existing Core ML model.

I used random data as calibration data. Then I got.

unit: ms

device fp32 quantized a8w8
iphone 13 3.10 2.23
iphone 14 Pro 2.29 1.83
iphone 15 Pro 2.24 1.38

Maybe we can use "real" calibration data to check if quantized int8 models could meet accuracy thresholds.

freedomtan avatar Jun 14 '24 07:06 freedomtan

Maybe we can use "real" calibration data to check if quantized int8 models could meet accuracy thresholds.

I will try to do that.

anhappdev avatar Jun 14 '24 09:06 anhappdev

@freedomtan Good to hear that. For quantization, some weights quantize 'better' (less performance drop) than others, the training hparams have an impact. I'd be curious to know how the timm weights I've trained so far fair in that regard.

rwightman avatar Jun 14 '24 15:06 rwightman