benchmark Benchmark quantization

Add benchmarks for quantized models.

This might be implemented as a new 'flavor' of test_eval, where most models raise NotImplemented and it is strictly opt-in to add quantization for particular models.

@jamesr66a can you add any specifics around which models you'd like to quantize and what the minimal number of quantized models is that would be useful to enable?

Feb 23 '21 22:02 wconstab

@jamesr66a can you add any specifics around which models you'd like to quantize and what the minimal number of quantized models is that would be useful to enable?

As many as possible, quantization is a canonical technique for performance optimization and a large fraction of Facebook production runs quantized models

Feb 24 '21 00:02 jamesr66a

Is there a formulaic way to apply quantization to a bunch of models? Or do we need expert help to make sure it's done correctly per model?

Feb 24 '21 18:02 wconstab

@wconstab since it's changing the semantics of the model it usually requires Human intervention (though there are some things that attempt to do auto-tuning)

I think it would make sense to have some canoncial quantized TorchVision pre-trained models in the benchmark suite. I'll bet there are some of those laying around somewhere. Maybe @vkuzo @jerryzh168 could point to those

Feb 24 '21 21:02 jamesr66a

What's the context on which benchmark suite is this and what the goal is? In general more benchmarking of quantized models would be really valuable, let me know how our team can help make that happen.

For something easy, you can check out the pre-trained quantized models in TorchVision: https://github.com/pytorch/vision/tree/master/torchvision/models/quantization

Feb 25 '21 16:02 vkuzo

@vkuzo We currently run a bunch of open-source models in a harness that lets us track cpu/gpu train/eval performance and do several things with the data. One is to identify regressions via CI/bisect infrastructure. Another is perf engineers collects various stats and come up with plans or validate prototypes. Here is a dashboard.

We'd like to include quantization. If you're willing to do some of the legwork, mainly making correct changes (or reviewing PRs) to quantize existing models in the suite or include new ones, we can take it the rest of the way (tying into CI infra/including in the 'score' we compute, dashboards, etc.) - and hopefully it'll also motivate folks on the perf teams to make quantized models faster and keep regressions out.

If you can at least identify a set of existing models in our suite that would be easy to quantize, or suggest why we shouldn't do that and instead should add new models, that would help. And then if you're willing to, I can show how to make the change in the suite.

Feb 25 '21 17:02 wconstab

Sounds great. I'd recommend starting with MobileNetV2 as something easy and which we already benchmark internally, so we can compare data. We can use FX graph mode quant on that model since it should be symbolically traceable. Let me see if anyone on quantization side would like to help.

Feb 26 '21 15:02 vkuzo

cc @HDCharles who is interested in testing this out for mobilenetv2

Feb 26 '21 21:02 vkuzo

Yeah, I will get started on this

Mar 01 '21 16:03 HDCharles

@wconstab for the quantization benchmarking I'm wondering what a desirable 'scope' would be. The most natural type of quantization to benchmark is QAT, which has both training and evaluation. Then there's static quantization which doesn't have training, though it does have a post-training-calibration process that serves a similar purpose. i.e. if you have a pretrained model, you could either calibrate it with QAT or calibration depending on the context. So I'm not sure how rigid the definition of training should be and whether the calibration benchmark makes sense to add here.

Mar 09 '21 23:03 HDCharles

@HDCharles , just so there is nothing blocked, I'd recommend sticking to training and inference only (no calibration) in the first PR. Then, if we decide that calibration is OK to add, we can add it in a future PR.

Mar 10 '21 16:03 vkuzo

I see- I so the calibration step is a post-processing alternative to QAT, and QAT is full training (or is it sometimes just a fine-tuning of a pre-non-QAT-trained model)

Note that in our suite we focus on breadth of models and don't measure convergence properties, so the focus is purely on model runtime.

As far as the design for integrating these new features into torchbench, i can think of 2 ways to approach it.

treat quantized models as separate entities from the current models. You could define a new top-level class TestBenchNetwork which implements whatever you want for this, and then have that test class parameterized only to run over the newly added models. You'd have to do more work here to modify the way models are iterated over, though, and I'd want to have a couple others review that since it impacts various workflows besides test_bench.py.
mix the quantized models in with the rest of the models, make quantization a new 'mode'. You'd have to modify the test parametrize function to emit a new combination of quant, no-quant modes here, and every model would have to accept that new kwarg. the most annoying part of this is that you'd have to add a 'raise NotImplementedError()' to all the non-quantized models if they are invoked in the quant mode. Raising this error would cause pytest to skip that test and not include it in the output.

I think I am leaning towards (2), but let me know what you think. (Also @xuzhao9 and @jamesr66a)

Mar 10 '21 17:03 wconstab

To clarify, I think there's actually three execution modes being discussed here:

a) Training with QAT quant-dequant nodes inserted at various program points b) The calibration step of post-training quantization, running the model in inference with instrumentation to collect quantization statistics c) Inference runtime with quantized operators (post-calibration and after a conversion step)

In all cases I think the goal is to collect statistics about the model runtime rather than convergence properties. When I proposed adding quantization to torchbench I was referring mostly to (c). IIUC (a) and (b) would basically be sanity checks that make sure the quant/dequant nodes or observer nodes don't slow down execution too much from baseline (@HDCharles @vkuzo correct me if i'm wrong).

Given that (a) is coupled with training and (b) and (c) are coupled with inference, I'm not sure how to fit this into the parametrize interface. Is it true that parameterize exposes the cross product of all parameters? Is there a way to basically say "QAT on the inference benchmark is not valid, skip"?

Mar 10 '21 18:03 jamesr66a

I'm a little worried about scope creep here before we actually know that this data is reliable. Would it make sense to just do the simplest possible thing first (just land a PR measuring inference time for a single model with quantized kernels), look at the data coming out, and reevaluate based on what we see?

Mar 10 '21 18:03 vkuzo

@vkuzo I agree with that, iiuc you're referring to (c) in my list?

Mar 10 '21 18:03 jamesr66a

@vkuzo I agree with that, iiuc you're referring to (c) in my list?

yeah, that sounds great to me. If I had to rank a, b and c by priority, it would be

c (inference)
a (QAT training time)
b (PTQ calibration time)

Mar 10 '21 18:03 vkuzo

Well here is an initial PR, https://github.com/pytorch/benchmark/pull/323

This one is doing C and A.

Mar 12 '21 08:03 HDCharles

if you have a pretrained model, you could either calibrate it with QAT or calibration depending on the context.

@HDCharles, please confirm if pretrained quantized models have to be calibrated for benchmarking, or if something like #417 would suffice for benchmarking. Thank you!

Jun 30 '21 05:06 imaginary-person

@HDCharles, please confirm if pretrained quantized models have to be calibrated for benchmarking, or if something like #417 would suffice for benchmarking. Thank you!

as long as the model does not have data dependent control flow, there should be no need to do calibration before benchmarking

Jun 30 '21 22:06 vkuzo

benchmark benchmark copied to clipboard

Benchmark quantization

benchmark
benchmark copied to clipboard