pentium3
pentium3
# summary ## key problem ### workload efficient generative inference for **Transformer models**. (while #256 can be generally applied for all DNN models) large deep models, with tight latency targets...
https://www.usenix.org/conference/osdi22/presentation/yu
https://proceedings.mlr.press/v202/sheng23a.html
https://github.com/vllm-project/vllm
https://zhuanlan.zhihu.com/p/660282411
https://www.anuragkhandelwal.com/papers/shepherd.pdf
https://dl.acm.org/doi/pdf/10.1145/3600006.3613175
https://dl.acm.org/doi/10.1145/3575693.3575721
https://zhuanlan.zhihu.com/p/648759542
https://assets.amazon.science/4b/ee/9fa14afa47d3bcaa9c54b904daa5/diffusionpipe-training-large-diffusion-models-with-efficient-pipelines.pdf