bark.cpp icon indicating copy to clipboard operation
bark.cpp copied to clipboard

Update on model development

Open ClarissaGazalaEvanthe opened this issue 2 years ago • 11 comments

Please make a simple model for this test program, which can be used immediately. I'm not very good at python, sorry to bother you

ClarissaGazalaEvanthe avatar Jul 31 '23 03:07 ClarissaGazalaEvanthe

Hi @gwcangtip ! The first model supported is the basic 24KHz model presented by Suno in their demo. It should be available by the end of this week.

PABannier avatar Jul 31 '23 07:07 PABannier

This will be really awesome. Can't wait to use it!

planatscher avatar Aug 01 '23 12:08 planatscher

Hi @gwcangtip ! The first model supported is the basic 24KHz model presented by Suno in their demo. It should be available by the end of this week.

When do you make ready-to-use models?

ClarissaGazalaEvanthe avatar Aug 07 '23 11:08 ClarissaGazalaEvanthe

Hi @gwcangtip ! Thanks for the interest for the repo. I'm making a quick update for anyone interested in bark.cpp. I've spent the past week cleaning the repo and making sure the implementations of the 3 encoders were right. I have yet to integrate encodec.cpp (already implemented here) to bark.cpp. I'm making the final stretch of work this week.

PABannier avatar Aug 07 '23 11:08 PABannier

Hello everyone! Quick update on the recent progress made in the last week.

All the components (the 3 encoders and the Codec model) are now implemented and working. The end-to-end pipeline works fine, and I do obtain high-quality audio in output. Currently, I still have spotted 2 bugs (one in the tokenizer, one in the fine encoder) which make the model produce non-sense for some inputs. After fixing these two bugs, we should have a first working version of bark.

Regarding the performance, the model takes 17 seconds on my MacBook Pro M2 to generate a 2-second audio. There are still a lot of improvements to be made (unnecessary memory copies for instance) on the codebase. Furthermore, I expect significant improvement in speed once we support mixed precision and quantization. We have a dedicated issue (#46) to perform benchmarks and I'll publish them in the README once the aforementioned bugs are fixed.

PABannier avatar Aug 11 '23 08:08 PABannier

Thanks for all the work you've put into this, @PABannier ! I can't wait to see this evolve as it gets more efficient.

kskelm avatar Aug 11 '23 12:08 kskelm

Regarding the performance, the model takes 17 seconds on my MacBook Pro M2 to generate a 2-second audio.

On a Ryzen 3600 using 6 threads, I see about 2 minutes for the "this is an audio" prompt. That's with AVX2 enabled for GGML. I tried with OpenBLAS, but that was even slower. I'm not sure why it's so slow.

Also, what needs to be done to be able to reuse the model for subsequent calls to bark_generate_audio? I can put the calls to bark_generate_audio in the loop with the already loaded model, but after 5 or so calls it crashes because it can't allocate any more memory. I'm not sure what needs to be cleaned up between calls.

I've also tried with different seed values, and most of them sound terrible, or are not spoken audio at all.

jzeiber avatar Aug 12 '23 20:08 jzeiber

Hi @jzeiber ! Thanks for the info. As for the nonsense output, I have yet to fix a bug in the fine encoder. This is why we have poor output for most of the prompts.

As for memory allocation, have you tried re creating a GGML context for each model, every time you generate a prompt?

As for speed, I'm sure there are some memory leaks or unnecessary copies that I'll need to track down. But first i'm focusing on fixing the aforementioned bug in the fine encoder.

PABannier avatar Aug 12 '23 20:08 PABannier

Hi @jzeiber ! Thanks for the info. As for the nonsense output, I have yet to fix a bug in the fine encoder. This is why we have poor output for most of the prompts.

Alright, that's makes sense. It was just quite curious that seed value 0 seems to be the best with different prompts. I'm not sure what's special about that seed.

As for memory allocation, have you tried re creating a GGML context for each model, every time you generate a prompt?

I haven't tried. I was trying to avoid having to reload the entire model each time, but if I can just recreate the model ctxs each time that should work.

As for speed, I'm sure there are some memory leaks or unnecessary copies that I'll need to track down. But first i'm focusing on fixing the aforementioned bug in the fine encoder.

Yes, that sounds good. Get the basics done first to get good output, then improve what's there. Great work so far!

jzeiber avatar Aug 12 '23 20:08 jzeiber

Quick update, I wrote 3 unit tests comparing the output of the fine encoder against the original bark implementation.

./data/fine/test_fine_1.bin
run_test_on_codes : failed test
       abs_tol=0.0100, rel_tol=0.0100, abs max viol=0.0917, viol=80.0%
   TEST 1 FAILED.
./data/fine/test_fine_2.bin
run_test_on_codes : failed test
       abs_tol=0.0100, rel_tol=0.0100, abs max viol=89.0242, viol=100.0%
   TEST 2 FAILED.
./data/fine/test_fine_3.bin
run_test_on_codes : failed test
       abs_tol=0.0100, rel_tol=0.0100, abs max viol=0.1022, viol=89.4%
   TEST 3 FAILED.

All tests are currently failing, meaning that the fine encoder is not correctly implemented. More interestingly, the absolute difference in the logits of the fine encoder is not significant for all the token sequences (e.g. test 1 with only an abs max viol of 0.0917). In practice, this gives noisy outputs or missing words in the generated audio. However, when the difference is significant (e.g. test 2 with a abs max viol of 89.0242), the model is spewing out non sense.

After investigation, the bug is in the non causal self attention block. Although the queries and keys are identical, KQ is completely different from q @ transpose(-2, -1) and full of almost zero values. This is strange: i've checked the dimensions, the strides (making the key and query tensors contiguous did not change anything) and obviously the coefficients as stated previously.

Pinging @jzeiber @Green-Sky @kskelm @jmtatsch as they are following the updates on the model development.

PABannier avatar Aug 14 '23 18:08 PABannier

For those interested in Bark,

We now have a first working stable version of bark.cpp that supports quantization with #139 ! Make sure to pull the latest version of Encodec and Bark, by following the instructions.

Feel free to send me any feedback :)

PABannier avatar Apr 10 '24 13:04 PABannier