IMS-Toucan icon indicating copy to clipboard operation
IMS-Toucan copied to clipboard

Relationship between hop length and upsample rates

Open roedoejet opened this issue 2 years ago • 2 comments

me again 😄

When setting the hop length - your 16kHz spectrogram has a hop length of 256 and a window of (4*hop_length) 1024.

Since you are tripling the sampling rate, why is the hop length set to only 1.5 times (384) and window of 1536? I would have expected that the relationship of hop length should stay proportional to the sampling rate. Indeed you have a comment in HiFiGANDataset.py that states "hop length of spec loss must be same as the product of the upscale factors", but your upscale rates (8, 6, 4, 4) have a product of 768 (3 * 256). So I would have expected the hop_length to be set at 768 and window set to 3072. Am I misunderstanding something here?

P.S sorry if I'm being pedantic 😅

roedoejet avatar Sep 23 '22 18:09 roedoejet

Good point, I'm not sure why I did it that way and I even contradict myself with that comment, so it might be a bug 🤔

The resulting spectrogram will be slightly more small-band than it should be, which isn't a big problem, the impact on the actual performance is probably marginal because the Feature Matching Loss of the Discriminators massively outweighs the Spectogram Distance loss. The distance loss is mostly there for a little bit of warmup before the discriminator losses come in. But still, no reason not to just make it work as originally intended. I'll try it out in my current training run and if it doesn't cause weird size mismatches, I'll change it in all branches. Thanks for pointing that out, I would never have looked at that piece of code again :D

Flux9665 avatar Sep 23 '22 19:09 Flux9665

cool, yea, probably not a big problem at all, just thought I'd bring it up. Thanks!

roedoejet avatar Sep 26 '22 21:09 roedoejet

I changed the hop length, so it's the way it's supposed to be. In practice I didn't notice a difference though.

Also, the new Avocodo checkpoint is finally out! I'm not sure though if the improvement is even really noticeable. I'll train for a bit longer and update the model in the release page at some point, but I don't think that the difference is very big.

Flux9665 avatar Oct 25 '22 16:10 Flux9665

great - thanks for following up! Yea, it seems like the biggest improvement was with unseen speakers. In any case, this is great thanks!

roedoejet avatar Oct 25 '22 16:10 roedoejet