IMS-Toucan
IMS-Toucan copied to clipboard
Relationship between hop length and upsample rates
me again 😄
When setting the hop length - your 16kHz spectrogram has a hop length of 256 and a window of (4*hop_length) 1024.
Since you are tripling the sampling rate, why is the hop length set to only 1.5 times (384) and window of 1536? I would have expected that the relationship of hop length should stay proportional to the sampling rate. Indeed you have a comment in HiFiGANDataset.py
that states "hop length of spec loss must be same as the product of the upscale factors", but your upscale rates (8, 6, 4, 4) have a product of 768 (3 * 256). So I would have expected the hop_length to be set at 768 and window set to 3072. Am I misunderstanding something here?
P.S sorry if I'm being pedantic 😅
Good point, I'm not sure why I did it that way and I even contradict myself with that comment, so it might be a bug 🤔
The resulting spectrogram will be slightly more small-band than it should be, which isn't a big problem, the impact on the actual performance is probably marginal because the Feature Matching Loss of the Discriminators massively outweighs the Spectogram Distance loss. The distance loss is mostly there for a little bit of warmup before the discriminator losses come in. But still, no reason not to just make it work as originally intended. I'll try it out in my current training run and if it doesn't cause weird size mismatches, I'll change it in all branches. Thanks for pointing that out, I would never have looked at that piece of code again :D
cool, yea, probably not a big problem at all, just thought I'd bring it up. Thanks!
I changed the hop length, so it's the way it's supposed to be. In practice I didn't notice a difference though.
Also, the new Avocodo checkpoint is finally out! I'm not sure though if the improvement is even really noticeable. I'll train for a bit longer and update the model in the release page at some point, but I don't think that the difference is very big.
great - thanks for following up! Yea, it seems like the biggest improvement was with unseen speakers. In any case, this is great thanks!