Add DeepSpeed-Inference to the list of supported backends

Open diegofiori opened this issue 2 years ago • 2 comments

Description

Currently we don’t support any runtime specific for transformer models. DeepSpeed has implemented a runtime we could use for accelerating the Transformer models at inference time.

Integration

The DeepSpeed-Inference module will be add as a Compiler in speedster. This means that we need to implement both the runtime in full precision and the quantization techniques supported by DeepSpeed Inference.

The compiler will have a PyTorch Interface since the DeepSpeed library is fully implemented using PyTorch, and thus it will be similar (conceptually) to other compilers having the PyTorch interface (See Torch-TensorRT as an example)

TODO list

[ ] Build a PoC of the feature, using DeepSpeed-Inference for accelerating the most used models on HF, as
- GPT2
- BERT
[ ] Compare the final latency with the results obtained with Speedster
[ ] If a positive impact of the feature is assessed we should implement it as a Compiler in Speedster. Note that when implementing a new Compiler we need to implement its InferenceLearner as well. InferenceLearners are the Python object we use for wrapping the compiled model and expose an interface similar to the original model.
[ ] Fork the nebullvm repo https://github.com/nebuly-ai/nebullvm
[ ] Read the Contribution Guidelines
[ ] Create a PR to main explaining your changes and showing the improvements obtained with the new Compiler respect the previous version

Jan 16 '23 12:01 diegofiori

Is anyone working on this? I'd like to give this a shot 😃

Feb 21 '23 20:02 BrianPulfer

Hello @BrianPulfer! Thank you very much! I assigned the issue to you. Feel free to ping me if you have any question about the issue or the code.

Feb 22 '23 06:02 diegofiori

nebuly nebuly copied to clipboard

Add DeepSpeed-Inference to the list of supported backends

Description

Integration

TODO list

nebuly
nebuly copied to clipboard