AudioLDM icon indicating copy to clipboard operation
AudioLDM copied to clipboard

Super resolution example?

Open devilismyfriend opened this issue 2 years ago • 13 comments

Would love to see code to reproduce the paper's super resolution

devilismyfriend avatar Feb 04 '23 07:02 devilismyfriend

Sure. We will open-source that part, which is also in the TODO list.

haoheliu avatar Feb 04 '23 19:02 haoheliu

Could you possibly just send the Audio Super Resolution model you used so that we don't have to download the dataset and train ourselves?

galfaroth avatar Feb 21 '23 13:02 galfaroth

@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.

haoheliu avatar Feb 21 '23 22:02 haoheliu

Awesome! Excited to test it out

On Tue, Feb 21, 2023, 2:56 p.m. haoheliu @.***> wrote:

@galfaroth https://github.com/galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.

— Reply to this email directly, view it on GitHub https://github.com/haoheliu/AudioLDM/issues/4#issuecomment-1439198792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUYC4ICC7NC7NXAYRMPKIU3WYVB2HANCNFSM6AAAAAAURAJXVY . You are receiving this because you authored the thread.Message ID: @.***>

devilismyfriend avatar Feb 21 '23 23:02 devilismyfriend

@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.

No way!

galfaroth avatar Feb 21 '23 23:02 galfaroth

Hi all, the code related to super-resolution and inpainting is available here: https://github.com/haoheliu/AudioLDM/blob/main/audioldm/pipeline.py#L223

It has not been integrated into the command line usage yet because I haven't come up with an elegant and simple interface. I'm just trying to avoid making this tool exceedingly heavy. And maybe super-resolution and inpainting are not that of board interest from my perspective (correct me if I'm wrong). So I'll temporarily leave super-resolution and inpainting in this python function form. You can still play with the function though. I've already tested it out and it all works fine.

haoheliu avatar Feb 25 '23 01:02 haoheliu

Hey, I tried using the new method:

def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
  waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
                                  seed=random_seed,ddim_steps=steps,
                                  duration=duration, batchsize=1,
                                  guidance_scale=guidance_scale,
                                  n_candidate_gen_per_text=int(n_candidates),
                                  time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting,
                                  freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
                                  )
  if(len(waveform) == 1):
    waveform = waveform[0]
  return waveform

but then I get:

[<ipython-input-11-eac161f8fca7>](https://localhost:8080/#) in upsample(original_filepath, text, duration, guidance_scale, random_seed, n_candidates, steps)
      8 
      9 def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
---> 10   waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
     11                                   seed=random_seed,ddim_steps=steps,
     12                                   duration=duration, batchsize=1,

[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in super_resolution_and_inpainting(latent_diffusion, text, original_audio_file_path, seed, ddim_steps, duration, batchsize, guidance_scale, n_candidate_gen_per_text, time_mask_ratio_start_and_end, freq_mask_ratio_start_and_end, config)
    258     )
    259 
--> 260     batch = make_batch_for_text_to_audio(text, fbank=mel[None,...], batchsize=batchsize)
    261 
    262     # latent_diffusion.latent_t_size = duration_to_latent_t_size(duration)

[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in make_batch_for_text_to_audio(text, waveform, fbank, batchsize)
     26     else:
     27         fbank = torch.FloatTensor(fbank)
---> 28         fbank = fbank.expand(batchsize, 1024, 64)
     29         assert fbank.size(0) == batchsize
     30 

RuntimeError: The expanded size of the tensor (1024) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 1024, 64]. Tensor sizes: [1, 512, 64]

I know the base SR = 16000, where do I specify the target SR? Can it upscale to 96000 for example?

galfaroth avatar Feb 25 '23 12:02 galfaroth

@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.

haoheliu avatar Feb 25 '23 12:02 haoheliu

@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.

Apart from upsample resolution, why do I get the error? Can you post an example of how to do the upsampling with this method?

galfaroth avatar Feb 25 '23 22:02 galfaroth

You can use the following script (sr_inpainting.py) @galfaroth

#!/usr/bin/python3
import os
from audioldm import text_to_audio, style_transfer, build_model, save_wave, get_time, super_resolution_and_inpainting
import argparse

CACHE_DIR = os.getenv(
    "AUDIOLDM_CACHE_DIR",
    os.path.join(os.path.expanduser("~"), ".cache/audioldm"))

parser = argparse.ArgumentParser()

parser.add_argument(
    "-t",
    "--text",
    type=str,
    required=False,
    default="",
    help="Text prompt to the model for audio generation",
)

parser.add_argument(
    "-f",
    "--file_path",
    type=str,
    required=False,
    default=None,
    help="(--mode transfer): Original audio file for style transfer; Or (--mode generation): the guidance audio file for generating simialr audio",
)

parser.add_argument(
    "--transfer_strength",
    type=float,
    required=False,
    default=0.5,
    help="A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text",
)

parser.add_argument(
    "-s",
    "--save_path",
    type=str,
    required=False,
    help="The path to save model output",
    default="./output",
)

parser.add_argument(
    "-ckpt",
    "--ckpt_path",
    type=str,
    required=False,
    help="The path to the pretrained .ckpt model",
    default=os.path.join(
                CACHE_DIR,
                "audioldm-s-full.ckpt",
            ),
)

parser.add_argument(
    "-b",
    "--batchsize",
    type=int,
    required=False,
    default=1,
    help="Generate how many samples at the same time",
)

parser.add_argument(
    "--ddim_steps",
    type=int,
    required=False,
    default=200,
    help="The sampling step for DDIM",
)

parser.add_argument(
    "-gs",
    "--guidance_scale",
    type=float,
    required=False,
    default=2.5,
    help="Guidance scale (Large => better quality and relavancy to text; Small => better diversity)",
)

parser.add_argument(
    "-dur",
    "--duration",
    type=float,
    required=False,
    default=10.0,
    help="The duration of the samples",
)

parser.add_argument(
    "-n",
    "--n_candidate_gen_per_text",
    type=int,
    required=False,
    default=3,
    help="Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with heavier computation",
)

parser.add_argument(
    "--seed",
    type=int,
    required=False,
    default=42,
    help="Change this value (any integer number) will lead to a different generation result.",
)

args = parser.parse_args()
assert args.duration % 2.5 == 0, "Duration must be a multiple of 2.5"

mode = "super_resolution_and_inpainting"
        
save_path = os.path.join(args.save_path, mode)

if(args.file_path is not None):
    save_path = os.path.join(save_path, os.path.basename(args.file_path.split(".")[0]))

text = args.text
random_seed = args.seed
duration = args.duration
guidance_scale = args.guidance_scale
n_candidate_gen_per_text = args.n_candidate_gen_per_text

os.makedirs(save_path, exist_ok=True)
audioldm = build_model(ckpt_path=args.ckpt_path)

waveform = super_resolution_and_inpainting(
    audioldm,
    text,
    args.file_path,
    random_seed,
    duration=duration,
    guidance_scale=guidance_scale,
    ddim_steps=args.ddim_steps,
    n_candidate_gen_per_text=n_candidate_gen_per_text,
    batchsize=args.batchsize,
    time_mask_ratio_start_and_end=(0.10, 0.15), # regenerate the 10% to 15% of the time steps in the spectrogram
    # time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting
    # freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
    freq_mask_ratio_start_and_end=(1.0, 1.0), # no super-resolution
)


save_wave(waveform, save_path, name="%s_%s" % (get_time(), text))

in the command line, run this script by:

python3 sr_inpainting.py -f trumpet.wav

Then the script will do inpainting on audio between 10% to 15% time steps.

haoheliu avatar Feb 26 '23 00:02 haoheliu

Hey! Thanks for the reply! What if I wanted to test the super resolution? Can you provide an example too? And possibly sample in and out example.

galfaroth avatar Feb 26 '23 09:02 galfaroth

omg it's happening

bitnom avatar Feb 28 '23 00:02 bitnom

Hi @galfaroth, Just modify this parameter freq_mask_ratio_start_and_end in @haoheliu 's sample code. You can spend a little time to understand this repo. it's a good investt.

Hikari-Tsai avatar Apr 17 '23 07:04 Hikari-Tsai