bark icon indicating copy to clipboard operation
bark copied to clipboard

13 second bypass limit demo

Open 7gxycn08 opened this issue 1 year ago • 27 comments

Added Demo to README on how to use large prompts and generate > 13 second audio files.

7gxycn08 avatar Apr 24 '23 06:04 7gxycn08

You can also split the input string by the full stop (.) Instead of splitting by amount of words. This way makes the generated audio file more clearer and should remove the audio cutouts In the final output.

You can achieve that by using setting words = long_string.split(".") and removing the for loop underneath.

for i in range(0, len(words), 10):
    text_prompt = " ".join(words[i:i+10])
    text_prompts.append(text_prompt)

and replace with this for loop instead

for sentence in words:
    # Append the sentence to the text_prompts list if it's not empty
    if sentence:
        text_prompts.append(sentence + ".")

7gxycn08 avatar Apr 24 '23 09:04 7gxycn08

If you don't mind the extra dependency, I'd recommend using NLTK to split the input. Using just a period will cause issues with abbreviations and such.

wsippel avatar Apr 24 '23 09:04 wsippel

NLTK

Just tried it out It does split the string beautifully with words = nltk.sent_tokenize(long_string)

7gxycn08 avatar Apr 24 '23 10:04 7gxycn08

if i remember correctly nltk makes mistakes around things like I spoke with Mr. Smith, and he.... Thanks for PR, super open to incorporate longer predicionts. atm would love to first learn how to best do this. i believe there are two main things that make it non-trivial beyond the sentence splitting:

  1. speaker/audio coherence. it might either be best to keep the same original history_prompt such as en_speaker_1, or it might be best to use the previously generated output at each step. or maybe there is a magic third option of some clever mixing/concatting of the 2

  2. potentially doing multiple generations per step (maybe even with auto selection of the best one). sometimes a generation goes off the rails or the model invents a new speaker even with the same prompt. especially when you generate many minutes of audio it would be a shame if some of the clips in between are not great. so we should find a good way to keep track of generations

would love thoughts toward those 2 goals

gkucsko avatar Apr 24 '23 12:04 gkucsko

especially when you generate many minutes of audio it would be a shame if some of the clips in between are not great. so we should find a good way to keep track of generations

For this part of the problem you can use a for loop at the end of the script. This might not be ideal though.

#Write each individual audio to a file this will generate multiple wav files for every split prompt from the list.

for i, audio_array in enumerate(audio_arrays):
    write_wav(f"audio_{i}.wav", SAMPLE_RATE, audio_array)

7gxycn08 avatar Apr 24 '23 17:04 7gxycn08

I've been tinkering with this and have unlimited tokens now. Coherency is lost between sentences, I was able to fix this by feeding the full output back into generate_audio. Here is my code if it's helpful

def text_to_audio(text, text_temp, waveform_temp, history_prompt):

    if(history_prompt == "Unconditional"):
        history_prompt = None
        
    #segment the sentences
    text_prompts_list = nltk.sent_tokenize(text)
    
    # generate audio from text
    audio_arrays = np.array([])
    i = 0
    
    for prompt in text_prompts_list:
        full_generation, audio_array = generate_audio(prompt,
                                     history_prompt,
                                     text_temp,
                                     waveform_temp,
                                     output_full = True)
        audio_arrays = np.concatenate((audio_arrays, audio_array))
                        
        save_as_prompt(os.path.join(cwd, f"bark/assets/userprompts/{i}.npz"), full_generation)
        history_prompt = os.path.join(cwd, f"bark/assets/userprompts/{i}.npz")
    # return audio array as output
    return SAMPLE_RATE, audio_arrays

fidofetch avatar Apr 25 '23 02:04 fidofetch

You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings. for now it just warns the user that there are voice variations hopefully this is useful.

for prompt in text_prompts_list:
    history_prompt = "en_speaker_1"
    audio_array = generate_audio(prompt, history_prompt=history_prompt)
    if prev_audio is not None:
        min_length = min(len(audio_array), len(prev_audio))
        audio_diff = np.abs(audio_array[:min_length] - prev_audio[:min_length])
        if np.max(audio_diff) > 0.1:
            diff_mean = np.mean(audio_diff)
            diff_std = np.std(audio_diff)
            print(f"WARNING: Abrupt change in speaker's voice detected. Mean diff = {diff_mean:.4f}, Std diff = {diff_std:.4f}")
    audio_arrays.append(audio_array)
    prev_audio = audio_array

7gxycn08 avatar Apr 25 '23 08:04 7gxycn08

I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:

sentences = nltk.sent_tokenize(string)
chunks = ['']
token_counter = 0
for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

wsippel avatar Apr 25 '23 13:04 wsippel

Why bark has this 13 seconds limit?

felipelalli avatar Apr 26 '23 06:04 felipelalli

I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:

sentences = nltk.sent_tokenize(string)
chunks = ['']
token_counter = 0
for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

Oh, neat! I have been doing the same thing essentially, except I have been using syllables. about 40 syllables seems to be a good spot I've found.

The steps that I take are:

  1. split the text into 'phrases' that end in pause characters. For example "Hello! I'm tired, I need to sleep - uh - soon." will turn into ["Hello! ", "I'm tired, ", "I need to sleep - ", "uh - ", "soon." ]
  2. Estimate the number of syllables in each part. [2, 4, 6, 1, 2]
  3. Recombine the parts into 'sentences' that do not exceed some maximum number of syllables. ["Hello! I'm tired,", "I need to sleep - ", "uh - soon."]
  4. Pass those parts to bark.

The 'magic trick' in my poking about seems to be segmenting sentences in such a way that a long pause character (ex. . , - ... etc.) are at the end of segments. This is useful since if they get cut off or are too long it's not too noticeable.

tarpeyd12 avatar Apr 26 '23 08:04 tarpeyd12

instead this:

text_prompts_list = nltk.sent_tokenize(long_string)

Can I use this? It will work the same?

textwrap.wrap( long_string, width=300, replace_whitespace=False, break_long_words=False, break_on_hyphens=False)

kun-create avatar Apr 26 '23 09:04 kun-create

I stitched wsippel method altogether and surprisingly got decent results.

You can test It out as Is or use different methods to compare the two here is the complete demo code.

from bark import generate_audio,preload_models
from scipy.io.wavfile import write as write_wav
import numpy as np
import nltk

nltk.download('punkt')
preload_models()

long_string = """
Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.
"""
sentences = nltk.sent_tokenize(long_string)

# Set up sample rate
SAMPLE_RATE = 22050
HISTORY_PROMPT = "en_speaker_6"

chunks = ['']
token_counter = 0

for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

# Generate audio for each prompt
audio_arrays = []
for prompt in chunks:
    audio_array = generate_audio(prompt,history_prompt=HISTORY_PROMPT)
    audio_arrays.append(audio_array)

# Combine the audio files
combined_audio = np.concatenate(audio_arrays)

# Write the combined audio to a file
write_wav("combined_audio.wav", SAMPLE_RATE, combined_audio)

7gxycn08 avatar Apr 26 '23 13:04 7gxycn08

Let me ask you guys: nltk.sent_tokenize works only to English words or it works fine to Portuguese, for example?

felipelalli avatar Apr 27 '23 01:04 felipelalli

@7gxycn08 Are you aware of this fork: https://github.com/JonathanFly/bark ? It would be nice to insert your impl into that fork.

felipelalli avatar Apr 27 '23 17:04 felipelalli

Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this: https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py

This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 : https://github.com/C0untFloyd/bark-gui https://github.com/serp-ai/bark-with-voice-clone

C0untFloyd avatar Apr 29 '23 10:04 C0untFloyd

We should incorporate the best stuff into the main repo or we will end with 15 good separated things. @gkucsko

standards_2x

felipelalli avatar Apr 29 '23 18:04 felipelalli

Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this: https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py

This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 : https://github.com/C0untFloyd/bark-gui https://github.com/serp-ai/bark-with-voice-clone

Works great, runs on GTX 970 very nicely takes about 2min to generate 30sec on default settings. I will add it to my installer. https://github.com/diStyApps/seait bark

diStyApps avatar Apr 29 '23 21:04 diStyApps

What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later?

felipelalli avatar Apr 29 '23 22:04 felipelalli

What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later?

It is Bark - with a Web GUI & some additions wrapped around it 😃

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

C0untFloyd avatar Apr 30 '23 09:04 C0untFloyd

It is Bark - with a Web GUI & some additions wrapped around it smiley

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

Very nice work putting this together! I just really love this open source community & wanted to say Thank you for sharing :)

appatalks avatar Apr 30 '23 14:04 appatalks

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

But I need to run Bark in batch mode (using a script). It would be great if I could set everything up in the GUI and then have the GUI provide me with the corresponding command line. I hope that makes sense now?

felipelalli avatar May 01 '23 02:05 felipelalli

This is on the list of things I want to add in fork. It's a bit a mess so probably not super soon, but I find myself wanting to do this too quite often.

JonathanFly avatar May 09 '23 13:05 JonathanFly

I stitched wsippel method altogether and surprisingly got decent results.

You can test It out as Is or use different methods to compare the two here is the complete demo code.

from bark import generate_audio,preload_models
from scipy.io.wavfile import write as write_wav
import numpy as np
import nltk

nltk.download('punkt')
preload_models()

long_string = """
Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.
"""
sentences = nltk.sent_tokenize(long_string)

# Set up sample rate
SAMPLE_RATE = 22050
HISTORY_PROMPT = "en_speaker_6"

chunks = ['']
token_counter = 0

for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

# Generate audio for each prompt
audio_arrays = []
for prompt in chunks:
    audio_array = generate_audio(prompt,history_prompt=HISTORY_PROMPT)
    audio_arrays.append(audio_array)

# Combine the audio files
combined_audio = np.concatenate(audio_arrays)

# Write the combined audio to a file
write_wav("combined_audio.wav", SAMPLE_RATE, combined_audio)

I believe that this breaks the non-speech you add to a text [laughs]

Ishino avatar May 14 '23 19:05 Ishino

Need to add following code snippet:

nltk.download('punkt')

qyqcswill avatar Aug 16 '23 13:08 qyqcswill

Added Demo to README on how to use large prompts and generate > 13 second audio files.

I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence.

Alnounou-UHCL avatar Sep 21 '23 15:09 Alnounou-UHCL

Added Demo to README on how to use large prompts and generate > 13 second audio files.

I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence.

That's may not specific to that code, but Bark in general.

Usually a generic tag like [laugh] works on a decent set of voices, but things like [happy] are much less likely and can wreck the output of the entire clip. If you need to make use of text tags that specific you're best bet is to generate random voices using the tags and sift through the outputs to find a few rare ones that do.

JonathanFly avatar Sep 21 '23 16:09 JonathanFly

You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings. for now it just warns the user that there are voice variations hopefully this is useful.

for prompt in text_prompts_list:
    history_prompt = "en_speaker_1"
    audio_array = generate_audio(prompt, history_prompt=history_prompt)
    if prev_audio is not None:
        min_length = min(len(audio_array), len(prev_audio))
        audio_diff = np.abs(audio_array[:min_length] - prev_audio[:min_length])
        if np.max(audio_diff) > 0.1:
            diff_mean = np.mean(audio_diff)
            diff_std = np.std(audio_diff)
            print(f"WARNING: Abrupt change in speaker's voice detected. Mean diff = {diff_mean:.4f}, Std diff = {diff_std:.4f}")
    audio_arrays.append(audio_array)
    prev_audio = audio_array

@7gxycn08 I'm implementing Bark with Vocos (see: https://github.com/gemelo-ai/vocos/blob/main/notebooks%2FBark%2BVocos.ipynb)

which means that I'm using the semantic_to_audio_tokens function instead of generate_audio. Can you recommend a way to implement your coherency-validating pattern, using the coarse_prompt or fine_prompt instead of audio_array?

Here's the code for that function.

from typing import Optional, Union, Dict

import numpy as np
from bark.generation import generate_coarse, generate_fine


def semantic_to_audio_tokens(
    semantic_tokens: np.ndarray,
    history_prompt: Optional[Union[Dict, str]] = None,
    temp: float = 0.7,
    silent: bool = False,
    output_full: bool = False,
):
    coarse_tokens = generate_coarse(
        semantic_tokens, history_prompt=history_prompt, temp=temp, silent=silent, use_kv_caching=True
    )
    fine_tokens = generate_fine(coarse_tokens, history_prompt=history_prompt, temp=0.5)

    if output_full:
        full_generation = {
            "semantic_prompt": semantic_tokens,
            "coarse_prompt": coarse_tokens,
            "fine_prompt": fine_tokens,
        }
        return full_generation
    return fine_tokens

platform-kit avatar Dec 05 '23 21:12 platform-kit