[REQUEST]Inference Optimized Pipeline Parallelism
As mentioned in the paper https://arxiv.org/abs/2207.00032, DeepSpeed inference supports pipeline parallelism, including hybrid scheduling, offloading activations, and communication, which have led to significant performance improvements. However, does DeepSpeed currently support these features? If not, is there a timeline for when they will be supported?
I'm also very interested in this issue. It would be great to get a clear response from the community on this matter. Thanks!
@champson, @yefanhust are there specific models/scenarios you are looking to apply pipeline parallelism for. The scenarios that PP is helpful for inference is very narrow, and applicable in just a handful of cases currently, so we have de-prioritized releasing these features. But we can re-visit this if there is a strong interest in the community for these features.
Hi!I am also interesting in this feature described in this paper, is there any demo or tutorial for 'hybrid pipeline inference schedule'?