VoiceCraft [in progress] fp16 memory optimizations

[in progress] fp16 memory optimizations

Open nwatx opened this issue 1 year ago • 5 comments

still need to bench performance accurately (will add a bench suite soon)
[x] working torch.half() / floating point
[ ] model memory optimization
[ ] kv cache memory optimization
[ ] clean up code

Apr 17 '24 21:04 nwatx

@nwatx Hi, is fp16 gives good output and speed up than normal ?

Apr 19 '24 09:04 rishikksh20

i haven't measured speed up, but from observation, it seems to improve memory consumption

Apr 19 '24 12:04 nwatx

the output seems to be of similar quality

Apr 19 '24 12:04 nwatx

Basically I tested this and found that there is no difference beyond just changing the KV_CACHE to fp16. Using the autocasting and stuff like that gives no benefit that I can see. I was sorta hopeful this did something I didn't, but no such luck.

On a side note, I can generate on my 2080 22g using the fp16 cache, previously it would OOM but so far it has not.

Apr 27 '24 12:04 Ph0rk0z

flashattention might help

Apr 27 '24 21:04 jasonppy

There's a vllm one that will work for all tensor core cards: https://github.com/vllm-project/vllm/blob/main/vllm/attention/ops/triton_flash_attention.py

current one only supports ampere+

Not sure how to wrap it around your forwards.

Apr 28 '24 11:04 Ph0rk0z

@nwatx I am getting error while running, specifically text_tokens.half() is creating an issue

Jul 22 '24 08:07 aashay-sarvam

VoiceCraft VoiceCraft copied to clipboard

[in progress] fp16 memory optimizations

VoiceCraft
VoiceCraft copied to clipboard