EET
EET copied to clipboard
Easy and Efficient Transformer : Scalable Inference Solution For Large NLP model
Easy and Efficient Transformer

EET(Easy and Efficient Transformer) is a friendly Pytorch inference plugin focus on Transformer-based models to make mega-size model affordable.
Features
- Support Mega-size model with single GPU. EET could make an 11B model available online within 24G Ampere GPU.
- Expertise in inference for multi-modal and NLP tasks (CLIP/GPT-3/Bert/Seq2seq etc.).
- High performance. Make the transformer-based model faster and faster with the effect of CUDA kernel optimization and quantization/sparsity algorithm.
- Out-of-the-box for Transformers and Fairseq. Save your pain of trivial configuration and make your model work within a few lines.
- Model Matirx
-
Quick Start
- Environment
-
Installation
- From Source
- From Docker
-
Run
- operators APIs
- Model APIs
- Application APIs
- Performance
- Cite Us
- Contact us
Model Matrix
model type | Transformers | Fairseq | Quantization | SpeedUp | Since version |
---|---|---|---|---|---|
GPT-3 | ✅ | ✅ | ✅ | 2~8x | 0.0.1 beta |
Bert | ✅ | ✅ | X | 1~5x | 0.0.1 beta |
ALBert | ✅ | ✅ | X | 1~5x | 0.0.1 beta |
Roberta | ✅ | X | X | 1~5x | 0.0.1 beta |
T5 | ✅ | X | X | 4~8x | 1.0 |
ViT | ✅ | X | X | 1~5x | 1.0 |
CLIP(GPT+ViT) | ✅ | X | X | 2~4x | 1.0 |
Distillbert | ✅ | X | X | 1~2x | 1.0 |
Quick Start
Environment
- cuda:>=10.1
- python:>=3.7
- gcc:>= 7.4.0
- torch:>=1.5.0
- numpy:>=1.19.1
- fairseq
- transformers
The above environment is the minimum configuration, and it is best to use a newer version.
Installation
Recommend using docker images.
From Source
If you are installing from source, you will need install the necessary environment.Then proceed as follows:
$ git clone https://github.com/NetEase-FuXi/EET.git
$ pip install .
Recommend using nvcr.io/nvidia/pytorch:21.12-py3 and other series of images, you can also use the provided Dockerfile file.
From Docker
$ git clone https://github.com/NetEase-FuXi/EET.git
$ docker build -t eet_docker:0.1 .
$ nvidia-docker run -it --net=host -v /your/project/directory/:/root/workspace eet_docker:0.1 bash
The EET and its required environment have been installed in docker.
Run
We provide three types of APIs:
- Operators APIs, such as embedding, masked-multi-head-attention, ffn etc. Enable you to define your custom models.
- Model APIs, such as TransformerDecoder, BertEncoder etc. Enable you to integrate EET into your pytorch project.
- Application APIs, such as Transformers Pipeline. Enable you to run your model in a few lines.
Operators APIs
Operators APIs are the intermediate representation of C++/CUDA and Python. We provide almost all the operators required for Transformer models. You can combine different OPs to build other model structures.
-
Operators API table
operators python API Remarks multi_head_attention EETSelfAttention self attention masked_multi_head_attention EETSelfMaskedAttention causal attention cross_multi_head_attention EETCrossAttention cross attention ffn EETFeedforward feed forward network embedding EETBertEmbedding correspondence to Fairseq and Transfomers layernorm EETLayerNorm same as nn.LayerNorm -
How to use
The definition of these OPs is in the file EET/csrc/py11/eet2py.cpp and some using examples were show in the files under python/eet, which tell us how to use those OPs to make up classic models.
Model APIs
As an plugin, EET provides friendly model APIs(python/eet) to integrated into Fairseq and Transformers.
All you need to do is find the corresponding class according to the tables below (usually with a prefix of 'EET') and initialize an object with the from_torch and from_pretrained function.
Note: We now only support pre-padding for GPT-3.
EET and fairseq class comparison table :
EET | fairseq | Remarks |
---|---|---|
EETTransformerDecoder | TransformerDecoder | |
EETTransformerDecoderLayer | TransformerDecoderLayer | |
EETTransformerAttention | MultiheadAttention | |
EETTransformerFeedforward | TransformerDecoderLayer | fusion of multiple small operators |
EETTransformerEmbedding | Embedding + PositionalEmbedding | |
EETTransformerLayerNorm | nn.LayerNorm |
EET and Transformers class comparison table :
EET | transformers | Remarks |
---|---|---|
EETBertModel | BertModel | |
EETBertEmbedding | BertEmbeddings | |
EETGPT2Model | GPT2Model | |
EETGPT2Decoder | GPT2Model | Transformers has no GPT2Decoder |
EETGPT2DecoderLayer | Block | |
EETGPT2Attention | Attention | |
EETGPT2Feedforward | MLP | |
EETGPT2Embedding | nn.Embedding | |
EETLayerNorm | nn.LayerNorm |
In addition to the basic model types above, we have extended some task-specific APIs to support different tasks. The table below is part of our task-specific model APIs :
EET | transformers | Remarks |
---|---|---|
EETBertForPreTraining | BertForPreTraining | |
EETBertLMHeadModel | BertLMHeadModel | |
EETBertForMaskedLM | BertForMaskedLM | |
EETBertForNextSentencePrediction | BertForNextSentencePrediction | |
EETBertForSequenceClassification | BertForSequenceClassification | |
EETBertForMultipleChoice | BertForMultipleChoice | |
EETBertForTokenClassification | BertForTokenClassification | |
EETBertForQuestionAnswering | BertForQuestionAnswering |
- How to use
This is a code snip to show how to use model APIs :

You can build your application with the model APIs directly with the task-specific APIs. There is an example of a fill-mask:
from eet import EETRobertaForMaskedLM
from transformers import RobertaTokenizer
input = ["My <mask> is Sarah and I live in London"]
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
eet_roberta_model = EETRobertaForMaskedLM.from_pretrained('roberta-base',max_batch = max_batch_size,data_type = data_type)
# first step: tokenize
model_inputs = tokenizer(input,return_tensors = 'pt')
masked_index = torch.nonzero(model_inputs['input_ids'][0] == tokenizer.mask_token_id, as_tuple=False).squeeze(-1)
# second step: predict
prediction_scores = eet_roberta_model(model_inputs['input_ids'].cuda(),attention_mask = model_inputs['attention_mask'])
# third step: argmax
predicted_index = torch.argmax(prediction_scores.logits[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)
For more examples, please refer to example/python/models.
Application APIs
EET provides a ready-made pipelines approach to simplify your application building for different tasks without using the model APIs above.
Here is an example :
import torch
from eet import pipeline
max_batch_size = 1
model_path = 'roberta-base'
data_type = torch.float16
input = ["My <mask> is Sarah and I live in London"]
nlp = pipeline("fill-mask",model = model_path,data_type = data_type,max_batch_size = max_batch_size)
out = nlp(input)
Now we support these tasks:
Task | Since version |
---|---|
text-classification | 1.0 |
token-classification | 1.0 |
question-answering | 1.0 |
fill-mask | 1.0 |
text-generation | 1.0 |
image-classification | 1.0 |
zero_shot_image_classification | 1.0 |
For more examples, please refer to example/python/pipelines.
Performance
Detailed performance data of GPT-3 and Bert model inference can be viewed at link.
- GPT-3 on A100

- Bert on 2080ti

Cite Us
If you use EET in your research, please cite the following paper.
@misc{https://doi.org/10.48550/arxiv.2104.12470,
doi = {10.48550/ARXIV.2104.12470},
url = {https://arxiv.org/abs/2104.12470},
author = {Li, Gongzheng and Xi, Yadong and Ding, Jingzhen and Wang, Duan and Liu, Bai and Fan, Changjie and Mao, Xiaoxi and Zhao, Zeng},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Easy and Efficient Transformer : Scalable Inference Solution For large NLP model},
Video
We have a share on ZhiYuan LIVE, link: https://event.baai.ac.cn/activities/325.
Contact us
You can post your problem with github issues.
You can also contact us by email :