audiocraft icon indicating copy to clipboard operation
audiocraft copied to clipboard

FR: Unlimited length audio+text conditioning with generate_with_chroma()

Open moiseshorta opened this issue 1 year ago • 8 comments

Hi,

I've got a script going which takes an input audio, crops it into 30 second chunks, passes each one consecutively to generate_with_chroma() function and then concatenates the results.

Even though I have tried setting torch.manual_seed() and pytorch_seed.seed() to a fixed seed, it seems that each 30 seconds, the generation changes the style completely. Any hints onto how to make an audio input pass through the network and maintain the consistency throughout?

thanks.

moiseshorta avatar Jun 12 '23 01:06 moiseshorta

Someone was already working on this & developed a fork. Perhaps you'll get more info from this conversation or working with them?

https://github.com/facebookresearch/audiocraft/issues/36#issuecomment-1586236702

Duemellon avatar Jun 12 '23 02:06 Duemellon

Have you tested a plain text prompt to make sure the problem is the seed? It probably isn't working working but just in case.

I haven't tried audiocraft, but in Bark but I had to set a laundry list of stuff to get a seed. (Also it cuts speed by like 50%)

https://github.com/JonathanFly/bark/blob/90c088852d6ef3d3ff802df7b04c8b75b7c7a680/bark_infinity/api.py#L279-L330

JonathanFly avatar Jun 12 '23 02:06 JonathanFly

Maybe this might help. Its not working completly because the audio is vanishing over time, but maybe a starting point. I could not get a complete 2 Minutes audio clip. After 1 minute the output is degrading.

`def generate_long_audio(model, initial_output, prompt_sample_rate=32000, total_length=120): """Generates a longer audio clip by repeatedly feeding the model with the last half of its output.

Args:
    model: the model used to generate the audio.
    initial_output (torch.Tensor): the initial audio clip.
    prompt_sample_rate (int): the sample rate used by the model.
    total_length (int): the total length of the audio clip to generate, in seconds.

Returns:
    torch.Tensor: the generated audio clip.
"""
# Convert total length from seconds to samples
total_length_samples = total_length * prompt_sample_rate
half_length_samples = 15 * prompt_sample_rate  # 15 seconds in samples

# Initialize the output with the initial audio clip
output = initial_output

while output.shape[-1] < total_length_samples:
    # Calculate the index at which to split the output
    split_index = max(output.shape[-1] - half_length_samples, 0)

    # Get the first and second halves of the output
    first_half = output[..., :split_index]
    second_half = output[..., split_index:]

    # Generate a continuation from the second half
    continuation = model.generate_continuation(second_half, prompt_sample_rate=prompt_sample_rate, progress=True)

    # Concatenate the first half of the output with the continuation to get a longer clip
    output = torch.cat([first_half, continuation], dim=-1)

    # Print the current length of the clip
    current_length_seconds = output.shape[-1] / prompt_sample_rate
    print(f'Current length of the clip: {current_length_seconds} seconds')

# If the output is longer than the desired length, trim it
if output.shape[-1] > total_length_samples:
    output = output[..., :total_length_samples]

return output`

bizrockman avatar Jun 12 '23 16:06 bizrockman

The vanishing audio after extension was experienced & addressed in the comment I tagged before. I invite you to check that out.

Duemellon avatar Jun 12 '23 16:06 Duemellon

As for fixing the seed and getting consistent generations this works for me: https://github.com/rsxdalv/tts-generation-webui/blob/main/src/musicgen/musicgen_tab.py#LL172C22-L172C39 https://github.com/rsxdalv/tts-generation-webui/blob/main/src/utils/set_seed.py#L8

In my quick testing I didn't see a performance difference, but it might be old CUDA and/or GPU.

rsxdalv avatar Jun 12 '23 22:06 rsxdalv

Maybe this might help. Its not working completly because the audio is vanishing over time, but maybe a starting point. I could not get a complete 2 Minutes audio clip. After 1 minute the output is degrading.

`def generate_long_audio(model, initial_output, prompt_sample_rate=32000, total_length=120): """Generates a longer audio clip by repeatedly feeding the model with the last half of its output.

Args:
    model: the model used to generate the audio.
    initial_output (torch.Tensor): the initial audio clip.
    prompt_sample_rate (int): the sample rate used by the model.
    total_length (int): the total length of the audio clip to generate, in seconds.

Returns:
    torch.Tensor: the generated audio clip.
"""
# Convert total length from seconds to samples
total_length_samples = total_length * prompt_sample_rate
half_length_samples = 15 * prompt_sample_rate  # 15 seconds in samples

# Initialize the output with the initial audio clip
output = initial_output

while output.shape[-1] < total_length_samples:
    # Calculate the index at which to split the output
    split_index = max(output.shape[-1] - half_length_samples, 0)

    # Get the first and second halves of the output
    first_half = output[..., :split_index]
    second_half = output[..., split_index:]

    # Generate a continuation from the second half
    continuation = model.generate_continuation(second_half, prompt_sample_rate=prompt_sample_rate, progress=True)

    # Concatenate the first half of the output with the continuation to get a longer clip
    output = torch.cat([first_half, continuation], dim=-1)

    # Print the current length of the clip
    current_length_seconds = output.shape[-1] / prompt_sample_rate
    print(f'Current length of the clip: {current_length_seconds} seconds')

# If the output is longer than the desired length, trim it
if output.shape[-1] > total_length_samples:
    output = output[..., :total_length_samples]

return output`

It seems that you are missing "descriptions" on continuations, a.k.a the text prompt

rsxdalv avatar Jun 12 '23 22:06 rsxdalv

Something changed with determenistic outputs, 0.0.1 worked but 0.0.2a2 only works for 2-3 seconds. https://github.com/facebookresearch/audiocraft/issues/111

rsxdalv avatar Jun 18 '23 14:06 rsxdalv

Curious if anyone has gotten high quality results in this vein. I tried for a few hours and noticed that I could fix the petering out or fix the disjointedness, but not fix both. Has anyone produced arbitrary length clips of high quality that neither peter out nor lose track of the starting theme?

starktyping avatar Oct 05 '23 11:10 starktyping