torchscale
torchscale copied to clipboard
Question regarding the configuration of decoder_retention_heads
Thank you for your great work!
I've noticed that your decoder_retention_heads is set to 3 by default, and the mask is also expanded to three dimensions to match. Have you experimented with the performance differences under different numbers of heads? Is this configuration sufficient in terms of attention performance? Since your model is primarily used for sequence models in language processing, I am looking to extend its application to image processing. I'm unsure if I should make any modifications to this aspect.
Thank you in advance for your response.
when I was adjusting the configurations of Retnet I also ran into this issue. Can you make a assert that the decoder_embed_dim and decoder_value_embed_dim must be a multiple of decoder_retention_heads.
@Kratos-Wen decoder_retention_heads
affects key_diim
, which is recommanded to set as 256.