jukebox icon indicating copy to clipboard operation
jukebox copied to clipboard

Is it using fast-wavenet-decoding?

Open veqtor opened this issue 4 years ago • 11 comments

I'm trying to figure out if fast-wavenet-decoding (https://github.com/tomlepaine/fast-wavenet) is used when decoding, if not, it's quite simple to implement and would yield a big performance boost, of couse IAF-decoder would be way faster but...

veqtor avatar May 16 '20 09:05 veqtor

As far as I understand it is using wavenet-like network to condition the transformer in the upsamplers, and the transformer is the bottleneck here

The decoding from tokens to raw audio is performed with VQVAE which is faster-than-realtime by itself

I tried to decode 2-level tokens into MEL-spectrograms and then feed into the waveglow vocoder, but the result is not better than the vqvae

Guess the information is just not there and you need these gigantic upsamplers to make up the details

gnhdnb avatar May 16 '20 11:05 gnhdnb

Okay, but then at least IAF with knowledge distillation, as discussed in "Future Work" part of the paper, should be a viable venue to accelerate upsampling

veqtor avatar May 17 '20 09:05 veqtor

I've looked a bit further at the code and discovered that the upsamplers are also transformers, so no, it's not the VQVAE that does the upsampling. What would be interesting is to train a wavenet as the bottom level upsampler, for smaller datasets it could be faster and yield as good results.

veqtor avatar May 20 '20 05:05 veqtor

Well, i wrote the same thing :)

As far as I understand it is using wavenet-like network to condition the transformer in the upsamplers, and the transformer is the bottleneck here

VQVAE is used to decode tokens into the audio, in fact, there is three different vqvae decoders in there - one for each level. That's why you have three different audio files for each sample after inference run.

The goal of fast upsampling is not only in doing it faster than the jukebox upsamplers, but doing it better than the 2nd-level vqvae.

gnhdnb avatar May 20 '20 21:05 gnhdnb

If you want to take a try, I have a 20hrs dataset of 10 second wav clips paired with 2nd-level embeddings. That's taken from various hip hop tracks.

gnhdnb avatar May 20 '20 21:05 gnhdnb

how to share? is it a few gigs? or more? how long did it take to generate? and what hardware (personal / cloud ) did you use? I want to have a crack - how many hip hop tracks did you use?

johndpope avatar May 21 '20 10:05 johndpope

I used about 200 tracks for the dataset, took about 20 minutes to process on rtx2070

I think it would be better to share the code instead of the data, so you can try it out on your music

https://github.com/gnhdnb/jukebox-fast-upsampling/blob/master/upsampler-dataset-prep.ipynb

This notebook splits your music into 10-second .wav chunks paired with .emb.npy embeddings with shape of [64, length of the chunk in 2nd level tokens]

No augmentation is performed

Note that I decode the tokens with the bottleneck decoder prior to saving, you can remove that if that's more suitable for your model architecture

gnhdnb avatar May 23 '20 11:05 gnhdnb

thx - what card are you using? I guess it's the ~ 3hrs -> 20 seconds of audio as per readme? but that spits out a lot of files? how many? I'm trying to guage whether I need to load up a server + gpu from vast.ai pr go out and buy a better graphics card for $2000 card. I see people are having problems with 8gb of VRAM.

johndpope avatar May 23 '20 14:05 johndpope

Thanks, upsampler-dataset-prep.ipynb seems helpful. I´m trying similar optimizations here.

matiaszabal avatar May 26 '20 22:05 matiaszabal

@gnhdnb Hey thanks so much for sharing your code! This was really helpful in extracting embeddings from the original model they shared. I'd like to know though how I might be able to extract embeddings less than 1 second long.

When I reduce the chunk length in seconds to less than 1 second (e.g. 0.3) I get this error:

TypeError                                 Traceback (most recent call last)
<ipython-input-86-305f48801b37> in <module>()
     21 
     22 filenames_and_embeddings = {}
---> 23 for i in range(chunk_length, len(y), chunk_length):
     24     chunk = y[i - chunk_length:i]
     25     audio_filename = f'%06d.wav' % counter

TypeError: 'float' object cannot be interpreted as an integer

Here is the whole code:

counter = 0
import librosa as l
import os
import soundfile as sf
import numpy as np
from shutil import rmtree
rmtree('sound')
output_path = 'sound'

if not os.path.exists(output_path):
  os.mkdir(output_path)
chunk_length_in_seconds = .3
sr = 44100
chunk_length = 128 * (sr * chunk_length_in_seconds // 128) # chunk size is rounded down to be multiple of 128

fullPath = '/content/drive/MyDrive/test.mp3'

y, _ = l.load(fullPath, sr = sr)

y = l.util.normalize(y)

filenames_and_embeddings = {}
for i in range(chunk_length, len(y), chunk_length):
    chunk = y[i - chunk_length:i]
    audio_filename = f'%06d.wav' % counter
    sf.write(os.path.join(output_path, audio_filename), chunk, sr, 'PCM_24')
    
    x = t.tensor(chunk).unsqueeze(0).unsqueeze(2).cuda()
    zs = vqvae.encode(x, start_level=2)
    emb = vqvae.bottleneck.decode(zs, start_level=2, end_level=None)
    filenames_and_embeddings.update({audio_filename:emb})

    np.save(os.path.join(output_path, f'%06d.emb' % counter), emb[0].squeeze(0).cpu().detach().numpy())

    counter += 1

Do you have any idea how I might be able to do it for less than a second? I was thinking of reducing the 128 to half of that or something.

leonhuene avatar Nov 05 '21 15:11 leonhuene

Never mind! Sorry I figured it out. All I had to do was convert chunk_length to int. I was assuming that it couldn't be converted because it was something like 123.12 or something but it seems every chunk_length I try with less than a second is something like 123.0.

So I just did:

chunk_length = int(chunk_length)

leonhuene avatar Nov 05 '21 20:11 leonhuene