DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Question to attention computation

Open yuzhenmao opened this issue 11 months ago • 0 comments

Hi, thank you for the amazing demo and doc! I have a question regarding this section in zero-inference. It is mentioned that "Thus, our current implementation computes attention scores on CPU". May I ask if there is a detailed comparison of the latency or throughput between GPU-attention and CPU-attention to support this desicion? I am also serious about the implementation detail of the CPU-attention computation. Thank you!

yuzhenmao avatar Dec 15 '24 06:12 yuzhenmao