Parallel-Tacotron2
Parallel-Tacotron2 copied to clipboard
fix in implementation of S-DTW backward @taras-sereda
Hey, I've found that in your implementation of S-DTW backward, E - matrices are not used, instead you are using G - matrices and their entries are ignoring scaling factors a, b, c
.
What's the reason for this?
My guess you are doing this in order to preserve and propagate gradients, because they are vanishing due to small values of a, b, c
. But I might be wrong, so I'd be glad to hear your motivation on doing this.
Playing with your code, I also found that gradients are vanishing, especially when bandwitdth=None
.
So I'm solving this problem by normalizing distance matrix, by n_mel_channel
. And with this normalization and exact implementation of S-dtw backward I'm able to converge on overfit experiments quicker then with non-exact computation of s-dtw backward.
I'm using these SDT hparams:
gamma = 0.05
warp = 256
bandwidth = 50
here is a small test I'm using for checks:
target_spectro = np.load('')
target_spectro = torch.from_numpy(target_spectro)
target_spectro = target_spectro.unsqueeze(0).cuda()
pred_spectro = torch.randn_like(target_spectro, requires_grad=True)
optimizer = Adam([pred_spectro])
# model fits in ~3k iterations
n_iter = 4_000
for i in range(n_iter):
loss = self.numba_soft_dtw(pred_spectro, target_spectro)
loss = loss / pred_spectro.size(1)
loss.backward()
if i % 1_000 == 0:
print(f'iter: {i}, loss: {loss.item():.6f}')
print(f'd_loss_pred {pred_spectro.grad.mean()}')
optimizer.step()
optimizer.zero_grad()
Curious to hear how your training is going! Best. Taras
Hi @taras-sereda , thank you very much for your effort! I think what you claimed seems worth considering, and I'm training the model with your update, but unfortunately, it shows no evidence on convergence so far (it lasts about 9 hours).
So the reason for G is from the derivation of backward following the original Soft DTW paper(please refer to Algorithm2), applying on the version introduced in the parallel tacotron 2 paper(please refer to section 4.2.). G is already expected to utilize calculated E where each coef a, b, c
is involved. But it was a bit ago so let me double-check on it.