Benchmarking different RoPE impls
Trying torch scripting and applying the rotations in the complex plane instead of R²
Benchmarked different implementations of RoPE:
- Base: what we have now, rotates in R²
- TorchScript: Same as base but adds torch scripting to the
apply_rotary_pos_emb()function - Complex: Instead of rotating in R² we rotate in the complex plane.
Results are below, key takeway: torchscripting makes the forward pass ~ 2x as fast as base, complex is about 4x as fast for the forward pass. The backwards pass is unaffected by choice of impl, and it takes up the vast majority of time, so when we add that to the mix the results are much less stark.
Here is the script for running the benchmark, and here are the various RoPE implementations.
I apologize for our delay in response. In order to help surface current, unresolved issues, we are closing tickets prior to February 29. Please reopen your ticket if you are continuing to experience this issue. Thank you!