Comparisons for Recognizers
Description
MMAction2 provide a large number of recognizers. In order to choose the right model for our applications, maybe we need to compare all models in one table. But I don't know how to do this.
I'm open to all suggestions.
Inference time statistics
- Inference time is my priority, so here is a table for this.
- Related codes could be found Here.
| model_name | Tesla V100-PCIE(32f, 16f, 8f) | GTX 1080ti(32f, 16f, 8f) | Jetson AGX Xavier(32f, 16f, 8f) |
|---|---|---|---|
| TSN_r50 | 31/17/10 | 52/26/14 | 258/134/80 |
| TSM_r50 | 34/19/13 | 59/30/16 | 278/145/86 |
| TSM_MobileNetV2 | 10/10/10 | 23/12/7 | 81/41/23 |
| TIN_r50 | 72/30/30 | 141/52/24 | 561/218/104 |
| TANet | 37/21/19 | 64/33/19 | 429/165/100 |
| I3D_r50 | 27/23/21 | 21/14/11 | 128/68/37 |
| 2Plus1d_r34 | 61/41/32 | 77/41/27 | 539/278/146 |
| CSN_r152 | 172/169/169 | 115/92/88 | 584/303/163 |
| SlowFast_r50 | 34/28/28 | 28/18/14 | 150/81/45 |
| SlowOnly_r50 | 58/38/28 | 79/41/23 | 576/301/160 |
| X3D | 95/93/92 | 84/52/49 | 415/212/112 |
Notes:
- The unit of inference time is millisecond(ms).
-
32f, 16f, 8fmeans number of frames for model inputs.- Default input shape for 2D Recognizers is
(1, num_frames, 3, 224, 224) - Default input shape for 3D Recognizers is
(1, 1, 3, num_frames, 224, 224)
- Default input shape for 2D Recognizers is
- TPN models and C3D models are not involved yet.
- TPN models are not valid for 32 frames.
- C3D models only support input shape
(1, 1, 3, 16, 112, 112).
TODO
- [x] Inference time for PyTorch models with default config.
- [ ] Inference time for PyTorch/ONNX/TensorRT model with various configs
- PyTorch models support fp16, fuse_conv_bn, cudnn, etc.
- TensorRT models suppport fp16/int8.
- [ ] Detailed information for each model, such as FLOPs, gpu memory, training/test results, etc.
- [ ] Inference time for input preprocessing.
Thanks, this is great!
- For TPN and C3D, is there a way to put up a setting that is as fair as possible? For example, if it supports 8 or 16 frames only, then you can forward 32 frames in one batch with batch size 4 and 2, respectively.
- Generally there is a speed/accuracy trade-off. Reporting their accuracies on a common test set (e.g. pick 4000 vids from the test set of K400) would be helpful to evaluate any performance degradations for different precisions.
I went through the codes of C3D. It turns out that we cannot modify C3D config to support other input shapes. I haven't study the codes of TPN, may take a look in April.
Maybe a table like this
| model type | model name | sampling strategy | v100/1080ti/agx latency(ms) | kinetics400 accuracy | sthv2 accuracy | comments |
|---|---|---|---|---|---|---|
| PyTorch | TSM-R50 | 1x1x8 | 13/16/86 | 70.24 / 89.56 | 57.86 / 61.12 | / |
Notes: We will support Support auto_fp16 using torch.cuda.amp in the future https://github.com/open-mmlab/mmcv/pull/791
Not really. Most info in the above table is already present in the modelzoo, or can be added to the tables (i.e. column v100/1080ti/agx latency) in modelzoo.
I think the most valuable part is the speed/accuracy benchmark for different precisions.
Something like that in GluonCV?
