Parallel-Tacotron2 fix in implementation of S-DTW backward @taras-sereda

fix in implementation of S-DTW backward @taras-sereda

Open taras-sereda opened this issue 3 years ago • 1 comments

Hey, I've found that in your implementation of S-DTW backward, E - matrices are not used, instead you are using G - matrices and their entries are ignoring scaling factors a, b, c. What's the reason for this? My guess you are doing this in order to preserve and propagate gradients, because they are vanishing due to small values of a, b, c. But I might be wrong, so I'd be glad to hear your motivation on doing this.

Playing with your code, I also found that gradients are vanishing, especially when bandwitdth=None. So I'm solving this problem by normalizing distance matrix, by n_mel_channel. And with this normalization and exact implementation of S-dtw backward I'm able to converge on overfit experiments quicker then with non-exact computation of s-dtw backward. I'm using these SDT hparams:

gamma = 0.05
warp = 256
bandwidth = 50

here is a small test I'm using for checks:

        target_spectro = np.load('')
        target_spectro = torch.from_numpy(target_spectro)
        target_spectro = target_spectro.unsqueeze(0).cuda()
        pred_spectro = torch.randn_like(target_spectro, requires_grad=True)

        optimizer = Adam([pred_spectro])

        # model fits in ~3k iterations
        n_iter = 4_000
        for i in range(n_iter):

            loss = self.numba_soft_dtw(pred_spectro, target_spectro)
            loss = loss / pred_spectro.size(1)
            loss.backward()

            if i % 1_000 == 0:
                print(f'iter: {i}, loss: {loss.item():.6f}')
                print(f'd_loss_pred {pred_spectro.grad.mean()}')

            optimizer.step()
            optimizer.zero_grad()

Curious to hear how your training is going! Best. Taras

Nov 18 '21 15:11 taras-sereda

Hi @taras-sereda , thank you very much for your effort! I think what you claimed seems worth considering, and I'm training the model with your update, but unfortunately, it shows no evidence on convergence so far (it lasts about 9 hours). So the reason for G is from the derivation of backward following the original Soft DTW paper(please refer to Algorithm2), applying on the version introduced in the parallel tacotron 2 paper(please refer to section 4.2.). G is already expected to utilize calculated E where each coef a, b, c is involved. But it was a bit ago so let me double-check on it.

Nov 19 '21 01:11 keonlee9420

Parallel-Tacotron2 Parallel-Tacotron2 copied to clipboard

fix in implementation of S-DTW backward @taras-sereda

Parallel-Tacotron2
Parallel-Tacotron2 copied to clipboard