rotary-embedding-torch icon indicating copy to clipboard operation
rotary-embedding-torch copied to clipboard

Fine-tuning Axial RoPE with frequency scaling?

Open tasansal opened this issue 1 year ago • 0 comments

Hi @lucidrains

We have trained a 3D ViT masked autoencoder using axial RoPE for an image size of 512x512x512 (3D scientific images, sampled from much larger volumes). Now I want to try fine-tuning the pre-trained model for larger (i.e. 1024x1024x1024) context size. However, it doesn't seem obvious to how. I am especially unclear on how to calculate the scale for axial RoPE correctly. Important note: we are not resizing the images; we tile the larger image with these "mini-cubes". So, going up in size means we have more context.

I would love to hear your feedback on how to do this properly. Below is my thought process (and please correct me where I am wrong!).

Normally, with 1D RoPE, we have the theta_rescale_factor, which changes freqs in RoPE directly. However, when freqs_for is set to pixel, the theta parameter isn't used to build freqs, which is probably fine since we don't have a single sequence and reuse [-1, 1] range for axial.

Anyhow, assuming above is fine, we apply axial RoPE with apply_rotary_emb instead of rotate_queries_and_keys. It seems like rotate_queries_and_keys does use get_scale to calculate the scale and apply it to q/k separately. But if caching is disabled, is the scale hard coded to be 1?

Q1: Would it make sense to implement the same logic in rotate_queries_and_keys to do it with axial variant?

Q2: Maybe an ignorant question, but why scale q with scale and k with scale**-1?

Q3: Is it OK to apply the scale directly using apply_rotary_emb and then fine-tune the model?

Q4: Is the scaling linear to the size of the dimension change? i.e., if I double the resolution, should the scale be 2.0 in that direction? Or do we need to account for diagonal distances etc in N-D case?

Q5: Is there any writeup (paper, pre-print etc) about axial-RoPE?

I may be completely off and need to understand the logic better. If that's the case, I would appreciate any help!

tasansal avatar Aug 26 '24 19:08 tasansal