VoiceCraft icon indicating copy to clipboard operation
VoiceCraft copied to clipboard

[in progress] fp16 memory optimizations

Open nwatx opened this issue 1 year ago • 5 comments

  • still need to bench performance accurately (will add a bench suite soon)

  • [x] working torch.half() / floating point

  • [ ] model memory optimization

  • [ ] kv cache memory optimization

  • [ ] clean up code

nwatx avatar Apr 17 '24 21:04 nwatx

@nwatx Hi, is fp16 gives good output and speed up than normal ?

rishikksh20 avatar Apr 19 '24 09:04 rishikksh20

i haven't measured speed up, but from observation, it seems to improve memory consumption

nwatx avatar Apr 19 '24 12:04 nwatx

the output seems to be of similar quality

nwatx avatar Apr 19 '24 12:04 nwatx

Basically I tested this and found that there is no difference beyond just changing the KV_CACHE to fp16. Using the autocasting and stuff like that gives no benefit that I can see. I was sorta hopeful this did something I didn't, but no such luck.

On a side note, I can generate on my 2080 22g using the fp16 cache, previously it would OOM but so far it has not.

Ph0rk0z avatar Apr 27 '24 12:04 Ph0rk0z

flashattention might help

jasonppy avatar Apr 27 '24 21:04 jasonppy

There's a vllm one that will work for all tensor core cards: https://github.com/vllm-project/vllm/blob/main/vllm/attention/ops/triton_flash_attention.py

current one only supports ampere+

Not sure how to wrap it around your forwards.

Ph0rk0z avatar Apr 28 '24 11:04 Ph0rk0z

@nwatx I am getting error while running, specifically text_tokens.half() is creating an issue

aashay-sarvam avatar Jul 22 '24 08:07 aashay-sarvam