VoiceCraft
VoiceCraft copied to clipboard
[in progress] fp16 memory optimizations
-
still need to bench performance accurately (will add a bench suite soon)
-
[x] working torch.half() / floating point
-
[ ] model memory optimization
-
[ ] kv cache memory optimization
-
[ ] clean up code
@nwatx Hi, is fp16 gives good output and speed up than normal ?
i haven't measured speed up, but from observation, it seems to improve memory consumption
the output seems to be of similar quality
Basically I tested this and found that there is no difference beyond just changing the KV_CACHE to fp16. Using the autocasting and stuff like that gives no benefit that I can see. I was sorta hopeful this did something I didn't, but no such luck.
On a side note, I can generate on my 2080 22g using the fp16 cache, previously it would OOM but so far it has not.
flashattention might help
There's a vllm one that will work for all tensor core cards: https://github.com/vllm-project/vllm/blob/main/vllm/attention/ops/triton_flash_attention.py
current one only supports ampere+
Not sure how to wrap it around your forwards.
@nwatx I am getting error while running, specifically text_tokens.half() is creating an issue