DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Formatting License MIT

Model Implementations for Inference (MII) is library from DeepSpeed, designed to make low-latency, low-cost inference of powerful transformer models not only feasible but also easily accessible. It does so by offering access to highly optimized implementations of thousands of widely used DL models. In fact, straight out-of-box, MII supported models can be deployed on-premise with just a few lines of code.

Note: MII is currently in a pre-release phase, this repo will be actively updated over the next several weeks with additional features, performance breakdowns, comparisons to other frameworks, etc.

How does MII work?

Under-the-hood MII is powered by DeepSpeed-Inference. Based on model type, model size, batch size, and available hardware resources, MII automatically applies the appropriate set of system optimizations from DeepSpeed-Inference to minimize latency and maximize thoughput. It does so using one of many pre-specified model injection policies, that allows DeepSpeed-Inference to identify the underlying PyTorch model architecture and replace it with an optimized implementation. This injection can replace a single GPU module with multi-GPU variations enabling models to run on single GPU device, or seamlessly scale to tens of GPUs for dense models and hundreds of GPUs for sparse models for lower latency and higher throughput.

MII makes the expansive set of optimizations in DeepSpeed-Inference easily accessible to its users by automatically integrating them to thousands of popular transformer models. For a full set of optimizations in DeepSpeed-Inference please see our paper: DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.

Supported Models and Tasks

MII currently supports over 20,000 models across a range of tasks such as text-generation, question-answering, text-classification. The models accelerated by MII are available through multiple open-sourced model repositories such as Hugging Face, FairSeq, EluetherAI, etc. We support dense models based on Bert, Roberta or GPT architectures ranging from few hundred million parameters to tens of billions of parameters in size. We continue to expand the list with support for massive hundred billion plus parameter dense and sparse models coming soon.

MII model support will continue to grow over time, check back for updates! Currently we support the following Hugging Face Transformers model families:

model family size range ~model count
bloom 0.3B - 176B 40
gptj 1.4B - 6B 80
gpt_neo 0.1B - 2.7B 240
gpt2 0.3B - 1.5B 6,500
roberta 0.1B - 0.3B 3,200
bert 0.1B - 0.3B 10,000

Getting Started with MII

Installation

pip install . will install all dependencies required for deployment. A PyPI release of MII is coming soon.

Deploying with MII

MII allows supported models to be deployed with just a few lines of code on-premise.

Several deployment and query examples can be found here: examples/local

As an example here is a deployment of the bigscience/bloom-350m model from Hugging Face:

Deployment

mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
mii.deploy(task="text-generation",
           model="bigscience/bloom-350m",
           deployment_name="bloom350m_deployment",
           mii_config=mii_configs)

This will deploy the model onto a single GPU and start the GRPC server that can later be queried.

Query

generator = mii.mii_query_handle("bloom350m_deployment")
result = generator.query({"query": ["DeepSpeed is", "Seattle is"]}, do_sample=True, max_new_tokens=30)
print(result)

The only required key is "query", all other items outside the dictionary will be passed to generate as kwargs. For Hugging Face provided models you can find all possible arguments in their documentation for generate.

Shutdown Deployment

mii.terminate("bloom350m_deployment")

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.