EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (paper, poster)

News

If you are interested in getting updates, please join our mailing list here.

[2024/04/23] We released the training code of EfficientViT-SAM.
[2024/04/06] EfficientViT-SAM is accepted by eLVM@CVPR'24.
[2024/03/19] Online demo of EfficientViT-SAM is available: https://evitsam.hanlab.ai/.
[2024/02/07] We released EfficientViT-SAM, the first accelerated SAM model that matches/outperforms SAM-ViT-H's zero-shot performance, delivering the SOTA performance-efficiency trade-off.
[2023/11/20] EfficientViT is available in the NVIDIA Jetson Generative AI Lab.
[2023/09/12] EfficientViT is highlighted by MIT home page and MIT News.
[2023/07/18] EfficientViT is accepted by ICCV 2023.

About EfficientViT Models

EfficientViT is a new family of ViT models for efficient high-resolution dense prediction vision tasks. The core building block of EfficientViT is a lightweight, multi-scale linear attention module that achieves global receptive field and multi-scale learning with only hardware-efficient operations, making EfficientViT TensorRT-friendly and suitable for GPU deployment.

Third-Party Implementation/Integration

Getting Started

conda create -n efficientvit python=3.10
conda activate efficientvit
conda install -c conda-forge mpi4py openmpi
pip install -r requirements.txt

EfficientViT Applications

Segment Anything

Datasets
Pretrained Models
Use in Pytorch
Evaluation
Visualization
Web Demo
Deployment using ONNX and TensorRT
Training

Model	Resolution	COCO mAP	LVIS mAP	Params	MACs	Jetson Orin Latency (bs1)	A100 Throughput (bs16)	Checkpoint
EfficientViT-SAM-L0	512x512	45.7	41.8	34.8M	35G	8.2ms	762 images/s	link
EfficientViT-SAM-L1	512x512	46.2	42.1	47.7M	49G	10.2ms	638 images/s	link
EfficientViT-SAM-L2	512x512	46.6	42.7	61.3M	69G	12.9ms	538 images/s	link
EfficientViT-SAM-XL0	1024x1024	47.5	43.9	117.0M	185G	22.5ms	278 images/s	link
EfficientViT-SAM-XL1	1024x1024	47.8	44.4	203.3M	322G	37.2ms	182 images/s	link

Table1: Summary of All EfficientViT-SAM Variants. COCO mAP and LVIS mAP are measured using ViTDet's predicted bounding boxes as the prompt. End-to-end Jetson Orin latency and A100 throughput are measured with TensorRT and fp16.

Image Classification

Datasets
Pretrained Models
Use in Pytorch
Evaluation
Deployment
Training

Semantic Segmentation

Datasets
Pretrained Models
Use in Pytorch
Evaluation
Visualization
Deployment

demo

Demo

GazeSAM: Combining EfficientViT-SAM with Gaze Estimation

GazeSAM demo

Contact

Han Cai: [email protected]

TODO

[x] ImageNet Pretrained models
[x] Segmentation Pretrained models
[x] ImageNet training code
[x] EfficientViT L series, designed for cloud
[x] EfficientViT for segment anything
[ ] EfficientViT for image generation
[ ] EfficientViT for CLIP
[ ] EfficientViT for super-resolution
[ ] Segmentation training code

Citation

If EfficientViT is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@article{cai2022efficientvit,
  title={Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition},
  author={Cai, Han and Gan, Chuang and Han, Song},
  journal={arXiv preprint arXiv:2205.14756},
  year={2022}
}

efficientvit
efficientvit copied to clipboard

Metadata

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (paper, poster)

News

About EfficientViT Models

Third-Party Implementation/Integration

Getting Started