pentium3 comments

Results 31 comments of


                                            pentium3

Efficiently Scaling Transformer Inference

# summary ## key problem ### workload efficient generative inference for **Transformer models**. (while #256 can be generally applied for all DNN models) large deep models, with tight latency targets...

Orca: A Distributed Serving System for Transformer-Based Generative Models

https://www.usenix.org/conference/osdi22/presentation/yu

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

https://proceedings.mlr.press/v202/sheng23a.html

Efficient Memory Management for Large Language Model Serving with PagedAttention

https://github.com/vllm-project/vllm

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints

https://zhuanlan.zhihu.com/p/660282411

SHEPHERD: Serving DNNs in the Wild

https://www.anuragkhandelwal.com/papers/shepherd.pdf

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

https://dl.acm.org/doi/pdf/10.1145/3600006.3613175

ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning

https://dl.acm.org/doi/10.1145/3575693.3575721

Fast Distributed Inference Serving for Large Language Models

https://zhuanlan.zhihu.com/p/648759542

DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines

https://assets.amazon.science/4b/ee/9fa14afa47d3bcaa9c54b904daa5/diffusionpipe-training-large-diffusion-models-with-efficient-pipelines.pdf