nebuly
nebuly copied to clipboard
Add DeepSpeed-Inference to the list of supported backends
Description
Currently we don’t support any runtime specific for transformer models. DeepSpeed has implemented a runtime we could use for accelerating the Transformer models at inference time.
Integration
The DeepSpeed-Inference module will be add as a Compiler
in speedster
. This means that we need to implement both the runtime in full precision and the quantization techniques supported by DeepSpeed Inference.
The compiler will have a PyTorch Interface since the DeepSpeed library is fully implemented using PyTorch, and thus it will be similar (conceptually) to other compilers having the PyTorch interface (See Torch-TensorRT as an example)
TODO list
- [ ] Build a PoC of the feature, using DeepSpeed-Inference for accelerating the most used models on HF, as
- GPT2
- BERT
- [ ] Compare the final latency with the results obtained with Speedster
- [ ] If a positive impact of the feature is assessed we should implement it as a Compiler in Speedster. Note that when implementing a new Compiler we need to implement its
InferenceLearner
as well.InferenceLearner
s are the Python object we use for wrapping the compiled model and expose an interface similar to the original model. - [ ] Fork the nebullvm repo https://github.com/nebuly-ai/nebullvm
- [ ] Read the Contribution Guidelines
- [ ] Create a PR to main explaining your changes and showing the improvements obtained with the new Compiler respect the previous version
Is anyone working on this? I'd like to give this a shot 😃
Hello @BrianPulfer! Thank you very much! I assigned the issue to you. Feel free to ping me if you have any question about the issue or the code.