Benchmarking different RoPE impls

Open saurabh111233212 opened this issue 2 years ago • 1 comments

Trying torch scripting and applying the rotations in the complex plane instead of R²

Oct 11 '23 15:10 saurabh111233212

Benchmarked different implementations of RoPE:

Base: what we have now, rotates in R²
TorchScript: Same as base but adds torch scripting to the apply_rotary_pos_emb() function
Complex: Instead of rotating in R² we rotate in the complex plane.

Results are below, key takeway: torchscripting makes the forward pass ~ 2x as fast as base, complex is about 4x as fast for the forward pass. The backwards pass is unaffected by choice of impl, and it takes up the vast majority of time, so when we add that to the mix the results are much less stark.

Here is the script for running the benchmark, and here are the various RoPE implementations.

Oct 11 '23 15:10 saurabh111233212

I apologize for our delay in response. In order to help surface current, unresolved issues, we are closing tickets prior to February 29. Please reopen your ticket if you are continuing to experience this issue. Thank you!

Apr 30 '24 18:04 dumitrac