bark
bark copied to clipboard
13 second bypass limit demo
Added Demo to README on how to use large prompts and generate > 13 second audio files.
You can also split the input string by the full stop (.) Instead of splitting by amount of words. This way makes the generated audio file more clearer and should remove the audio cutouts In the final output.
You can achieve that by using setting words = long_string.split(".") and removing the for loop underneath.
for i in range(0, len(words), 10):
text_prompt = " ".join(words[i:i+10])
text_prompts.append(text_prompt)
and replace with this for loop instead
for sentence in words:
# Append the sentence to the text_prompts list if it's not empty
if sentence:
text_prompts.append(sentence + ".")
If you don't mind the extra dependency, I'd recommend using NLTK to split the input. Using just a period will cause issues with abbreviations and such.
NLTK
Just tried it out It does split the string beautifully with words = nltk.sent_tokenize(long_string)
if i remember correctly nltk makes mistakes around things like I spoke with Mr. Smith, and he...
.
Thanks for PR, super open to incorporate longer predicionts. atm would love to first learn how to best do this. i believe there are two main things that make it non-trivial beyond the sentence splitting:
-
speaker/audio coherence. it might either be best to keep the same original history_prompt such as en_speaker_1, or it might be best to use the previously generated output at each step. or maybe there is a magic third option of some clever mixing/concatting of the 2
-
potentially doing multiple generations per step (maybe even with auto selection of the best one). sometimes a generation goes off the rails or the model invents a new speaker even with the same prompt. especially when you generate many minutes of audio it would be a shame if some of the clips in between are not great. so we should find a good way to keep track of generations
would love thoughts toward those 2 goals
especially when you generate many minutes of audio it would be a shame if some of the clips in between are not great. so we should find a good way to keep track of generations
For this part of the problem you can use a for loop at the end of the script. This might not be ideal though.
#Write each individual audio to a file this will generate multiple wav files for every split prompt from the list.
for i, audio_array in enumerate(audio_arrays):
write_wav(f"audio_{i}.wav", SAMPLE_RATE, audio_array)
I've been tinkering with this and have unlimited tokens now. Coherency is lost between sentences, I was able to fix this by feeding the full output back into generate_audio. Here is my code if it's helpful
def text_to_audio(text, text_temp, waveform_temp, history_prompt):
if(history_prompt == "Unconditional"):
history_prompt = None
#segment the sentences
text_prompts_list = nltk.sent_tokenize(text)
# generate audio from text
audio_arrays = np.array([])
i = 0
for prompt in text_prompts_list:
full_generation, audio_array = generate_audio(prompt,
history_prompt,
text_temp,
waveform_temp,
output_full = True)
audio_arrays = np.concatenate((audio_arrays, audio_array))
save_as_prompt(os.path.join(cwd, f"bark/assets/userprompts/{i}.npz"), full_generation)
history_prompt = os.path.join(cwd, f"bark/assets/userprompts/{i}.npz")
# return audio array as output
return SAMPLE_RATE, audio_arrays
You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings. for now it just warns the user that there are voice variations hopefully this is useful.
for prompt in text_prompts_list:
history_prompt = "en_speaker_1"
audio_array = generate_audio(prompt, history_prompt=history_prompt)
if prev_audio is not None:
min_length = min(len(audio_array), len(prev_audio))
audio_diff = np.abs(audio_array[:min_length] - prev_audio[:min_length])
if np.max(audio_diff) > 0.1:
diff_mean = np.mean(audio_diff)
diff_std = np.std(audio_diff)
print(f"WARNING: Abrupt change in speaker's voice detected. Mean diff = {diff_mean:.4f}, Std diff = {diff_std:.4f}")
audio_arrays.append(audio_array)
prev_audio = audio_array
I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:
sentences = nltk.sent_tokenize(string)
chunks = ['']
token_counter = 0
for sentence in sentences:
current_tokens = len(nltk.Text(sentence))
if token_counter + current_tokens <= 250:
token_counter = token_counter + current_tokens
chunks[-1] = chunks[-1] + " " + sentence
else:
chunks.append(sentence)
token_counter = current_tokens
Why bark has this 13 seconds limit?
I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:
sentences = nltk.sent_tokenize(string) chunks = [''] token_counter = 0 for sentence in sentences: current_tokens = len(nltk.Text(sentence)) if token_counter + current_tokens <= 250: token_counter = token_counter + current_tokens chunks[-1] = chunks[-1] + " " + sentence else: chunks.append(sentence) token_counter = current_tokens
Oh, neat! I have been doing the same thing essentially, except I have been using syllables. about 40 syllables seems to be a good spot I've found.
The steps that I take are:
- split the text into 'phrases' that end in pause characters. For example
"Hello! I'm tired, I need to sleep - uh - soon."
will turn into["Hello! ", "I'm tired, ", "I need to sleep - ", "uh - ", "soon." ]
- Estimate the number of syllables in each part.
[2, 4, 6, 1, 2]
- Recombine the parts into 'sentences' that do not exceed some maximum number of syllables.
["Hello! I'm tired,", "I need to sleep - ", "uh - soon."]
- Pass those parts to bark.
The 'magic trick' in my poking about seems to be segmenting sentences in such a way that a long pause character (ex. .
,
-
...
etc.) are at the end of segments. This is useful since if they get cut off or are too long it's not too noticeable.
instead this:
text_prompts_list = nltk.sent_tokenize(long_string)
Can I use this? It will work the same?
textwrap.wrap( long_string, width=300, replace_whitespace=False, break_long_words=False, break_on_hyphens=False)
I stitched wsippel method altogether and surprisingly got decent results.
You can test It out as Is or use different methods to compare the two here is the complete demo code.
from bark import generate_audio,preload_models
from scipy.io.wavfile import write as write_wav
import numpy as np
import nltk
nltk.download('punkt')
preload_models()
long_string = """
Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.
"""
sentences = nltk.sent_tokenize(long_string)
# Set up sample rate
SAMPLE_RATE = 22050
HISTORY_PROMPT = "en_speaker_6"
chunks = ['']
token_counter = 0
for sentence in sentences:
current_tokens = len(nltk.Text(sentence))
if token_counter + current_tokens <= 250:
token_counter = token_counter + current_tokens
chunks[-1] = chunks[-1] + " " + sentence
else:
chunks.append(sentence)
token_counter = current_tokens
# Generate audio for each prompt
audio_arrays = []
for prompt in chunks:
audio_array = generate_audio(prompt,history_prompt=HISTORY_PROMPT)
audio_arrays.append(audio_array)
# Combine the audio files
combined_audio = np.concatenate(audio_arrays)
# Write the combined audio to a file
write_wav("combined_audio.wav", SAMPLE_RATE, combined_audio)
Let me ask you guys: nltk.sent_tokenize
works only to English words or it works fine to Portuguese, for example?
@7gxycn08 Are you aware of this fork: https://github.com/JonathanFly/bark ? It would be nice to insert your impl into that fork.
Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this: https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py
This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 : https://github.com/C0untFloyd/bark-gui https://github.com/serp-ai/bark-with-voice-clone
We should incorporate the best stuff into the main repo or we will end with 15 good separated things. @gkucsko
Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this: https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py
This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 : https://github.com/C0untFloyd/bark-gui https://github.com/serp-ai/bark-with-voice-clone
Works great, runs on GTX 970 very nicely takes about 2min to generate 30sec on default settings.
I will add it to my installer.
https://github.com/diStyApps/seait
What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later?
What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later?
It is Bark - with a Web GUI & some additions wrapped around it 😃
You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.
It is Bark - with a Web GUI & some additions wrapped around it smiley
You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.
Very nice work putting this together! I just really love this open source community & wanted to say Thank you for sharing :)
You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.
But I need to run Bark in batch mode (using a script). It would be great if I could set everything up in the GUI and then have the GUI provide me with the corresponding command line. I hope that makes sense now?
This is on the list of things I want to add in fork. It's a bit a mess so probably not super soon, but I find myself wanting to do this too quite often.
I stitched wsippel method altogether and surprisingly got decent results.
You can test It out as Is or use different methods to compare the two here is the complete demo code.
from bark import generate_audio,preload_models from scipy.io.wavfile import write as write_wav import numpy as np import nltk nltk.download('punkt') preload_models() long_string = """ Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference. """ sentences = nltk.sent_tokenize(long_string) # Set up sample rate SAMPLE_RATE = 22050 HISTORY_PROMPT = "en_speaker_6" chunks = [''] token_counter = 0 for sentence in sentences: current_tokens = len(nltk.Text(sentence)) if token_counter + current_tokens <= 250: token_counter = token_counter + current_tokens chunks[-1] = chunks[-1] + " " + sentence else: chunks.append(sentence) token_counter = current_tokens # Generate audio for each prompt audio_arrays = [] for prompt in chunks: audio_array = generate_audio(prompt,history_prompt=HISTORY_PROMPT) audio_arrays.append(audio_array) # Combine the audio files combined_audio = np.concatenate(audio_arrays) # Write the combined audio to a file write_wav("combined_audio.wav", SAMPLE_RATE, combined_audio)
I believe that this breaks the non-speech you add to a text [laughs]
Need to add following code snippet:
nltk.download('punkt')
Added Demo to README on how to use large prompts and generate > 13 second audio files.
I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence.
Added Demo to README on how to use large prompts and generate > 13 second audio files.
I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence.
That's may not specific to that code, but Bark in general.
Usually a generic tag like [laugh] works on a decent set of voices, but things like [happy] are much less likely and can wreck the output of the entire clip. If you need to make use of text tags that specific you're best bet is to generate random voices using the tags and sift through the outputs to find a few rare ones that do.
You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings. for now it just warns the user that there are voice variations hopefully this is useful.
for prompt in text_prompts_list: history_prompt = "en_speaker_1" audio_array = generate_audio(prompt, history_prompt=history_prompt) if prev_audio is not None: min_length = min(len(audio_array), len(prev_audio)) audio_diff = np.abs(audio_array[:min_length] - prev_audio[:min_length]) if np.max(audio_diff) > 0.1: diff_mean = np.mean(audio_diff) diff_std = np.std(audio_diff) print(f"WARNING: Abrupt change in speaker's voice detected. Mean diff = {diff_mean:.4f}, Std diff = {diff_std:.4f}") audio_arrays.append(audio_array) prev_audio = audio_array
@7gxycn08 I'm implementing Bark with Vocos (see: https://github.com/gemelo-ai/vocos/blob/main/notebooks%2FBark%2BVocos.ipynb)
which means that I'm using the semantic_to_audio_tokens
function instead of generate_audio
. Can you recommend a way to implement your coherency-validating pattern, using the coarse_prompt
or fine_prompt
instead of audio_array
?
Here's the code for that function.
from typing import Optional, Union, Dict
import numpy as np
from bark.generation import generate_coarse, generate_fine
def semantic_to_audio_tokens(
semantic_tokens: np.ndarray,
history_prompt: Optional[Union[Dict, str]] = None,
temp: float = 0.7,
silent: bool = False,
output_full: bool = False,
):
coarse_tokens = generate_coarse(
semantic_tokens, history_prompt=history_prompt, temp=temp, silent=silent, use_kv_caching=True
)
fine_tokens = generate_fine(coarse_tokens, history_prompt=history_prompt, temp=0.5)
if output_full:
full_generation = {
"semantic_prompt": semantic_tokens,
"coarse_prompt": coarse_tokens,
"fine_prompt": fine_tokens,
}
return full_generation
return fine_tokens