candle Mutable state in `MultiHeadAttention` Structure and its impact on Concurrency

Mutable state in `MultiHeadAttention` Structure and its impact on Concurrency

Open WenqingZong opened this issue 2 years ago • 3 comments

trafficstars

Hello Candle team,

We found that a &mut self is used in MultiHeadAttention::forward() method because of the need of updating kv_cache. This leads to the fact that everything build upon MultiHeadAttention structure has to use &mut self for inference. Also, there is no function to refresh it.

We question this design decision as it makes multiple concurrent inferences a bit dangerous. We would like to ask if there are any plans to fix it in the near future?

Regards, Emotech Engineers

Oct 27 '23 15:10 WenqingZong

Hello, So indeed we have to use &mut here because of the kv cache. The usual way around this for running multiple inferences using the same model is to call clone on the model itself. This will not duplicate the weights as they will be shared among the tensors so the memory overhead should be small and then each model could be used independently. It turns out that whisper in particular did not have clone implemented but I just added it in #1200 . Do you think this would work for your use cases?

Oct 27 '23 15:10 LaurentMazare

That's awesome @LaurentMazare . I'll try it and see if it works.

May 09 '24 06:05 andrenatal

@LaurentMazare in the case of whisper on cuda, cloning the model won't allocate the weights again in the gpu memory?

May 09 '24 06:05 andrenatal

candle candle copied to clipboard

Mutable state in `MultiHeadAttention` Structure and its impact on Concurrency

candle
candle copied to clipboard