efficientvit
efficientvit copied to clipboard
EfficientViT is a new family of vision models for efficient high-resolution vision.
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (paper, poster)
News
If you are interested in getting updates, please join our mailing list here.
- [2024/04/23] We released the training code of EfficientViT-SAM.
- [2024/04/06] EfficientViT-SAM is accepted by eLVM@CVPR'24.
- [2024/03/19] Online demo of EfficientViT-SAM is available: https://evitsam.hanlab.ai/.
- [2024/02/07] We released EfficientViT-SAM, the first accelerated SAM model that matches/outperforms SAM-ViT-H's zero-shot performance, delivering the SOTA performance-efficiency trade-off.
- [2023/11/20] EfficientViT is available in the NVIDIA Jetson Generative AI Lab.
- [2023/09/12] EfficientViT is highlighted by MIT home page and MIT News.
- [2023/07/18] EfficientViT is accepted by ICCV 2023.
About EfficientViT Models
EfficientViT is a new family of ViT models for efficient high-resolution dense prediction vision tasks. The core building block of EfficientViT is a lightweight, multi-scale linear attention module that achieves global receptive field and multi-scale learning with only hardware-efficient operations, making EfficientViT TensorRT-friendly and suitable for GPU deployment.
Third-Party Implementation/Integration
Getting Started
conda create -n efficientvit python=3.10
conda activate efficientvit
conda install -c conda-forge mpi4py openmpi
pip install -r requirements.txt
EfficientViT Applications
Segment Anything
- Datasets
- Pretrained Models
- Use in Pytorch
- Evaluation
- Visualization
- Web Demo
- Deployment using ONNX and TensorRT
- Training
| Model | Resolution | COCO mAP | LVIS mAP | Params | MACs | Jetson Orin Latency (bs1) | A100 Throughput (bs16) | Checkpoint |
|---|---|---|---|---|---|---|---|---|
| EfficientViT-SAM-L0 | 512x512 | 45.7 | 41.8 | 34.8M | 35G | 8.2ms | 762 images/s | link |
| EfficientViT-SAM-L1 | 512x512 | 46.2 | 42.1 | 47.7M | 49G | 10.2ms | 638 images/s | link |
| EfficientViT-SAM-L2 | 512x512 | 46.6 | 42.7 | 61.3M | 69G | 12.9ms | 538 images/s | link |
| EfficientViT-SAM-XL0 | 1024x1024 | 47.5 | 43.9 | 117.0M | 185G | 22.5ms | 278 images/s | link |
| EfficientViT-SAM-XL1 | 1024x1024 | 47.8 | 44.4 | 203.3M | 322G | 37.2ms | 182 images/s | link |
Table1: Summary of All EfficientViT-SAM Variants. COCO mAP and LVIS mAP are measured using ViTDet's predicted bounding boxes as the prompt. End-to-end Jetson Orin latency and A100 throughput are measured with TensorRT and fp16.
Image Classification
- Datasets
- Pretrained Models
- Use in Pytorch
- Evaluation
- Deployment
- Training
Semantic Segmentation
- Datasets
- Pretrained Models
- Use in Pytorch
- Evaluation
- Visualization
- Deployment

Demo
- GazeSAM: Combining EfficientViT-SAM with Gaze Estimation

Contact
Han Cai: [email protected]
TODO
- [x] ImageNet Pretrained models
- [x] Segmentation Pretrained models
- [x] ImageNet training code
- [x] EfficientViT L series, designed for cloud
- [x] EfficientViT for segment anything
- [ ] EfficientViT for image generation
- [ ] EfficientViT for CLIP
- [ ] EfficientViT for super-resolution
- [ ] Segmentation training code
Citation
If EfficientViT is useful or relevant to your research, please kindly recognize our contributions by citing our paper:
@article{cai2022efficientvit,
title={Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition},
author={Cai, Han and Gan, Chuang and Han, Song},
journal={arXiv preprint arXiv:2205.14756},
year={2022}
}