torchscale Question regarding the configuration of decoder_retention

Question regarding the configuration of decoder_retention_heads

Open Kratos-Wen opened this issue 7 months ago • 2 comments

Thank you for your great work!

I've noticed that your decoder_retention_heads is set to 3 by default, and the mask is also expanded to three dimensions to match. Have you experimented with the performance differences under different numbers of heads? Is this configuration sufficient in terms of attention performance? Since your model is primarily used for sequence models in language processing, I am looking to extend its application to image processing. I'm unsure if I should make any modifications to this aspect.

Thank you in advance for your response.

Nov 30 '23 02:11 Kratos-Wen

when I was adjusting the configurations of Retnet I also ran into this issue. Can you make a assert that the decoder_embed_dim and decoder_value_embed_dim must be a multiple of decoder_retention_heads.

Dec 25 '23 23:12 jpokemon232

@Kratos-Wen decoder_retention_heads affects key_diim, which is recommanded to set as 256.

Dec 28 '23 13:12 sunyt32

torchscale torchscale copied to clipboard

Question regarding the configuration of decoder_retention_heads

torchscale
torchscale copied to clipboard