Li-Wei Chen
Li-Wei Chen
Hi, Sorry for the late reply. It's been years since I visit this codebase. If you are referring to the Eq. 4 in the paper, I think it's at https://github.com/b04901014/ISGAN/blob/master/src/model.py#L176...
Some guidelines for debugging. - Identify if the problem is from the proposed algorithm (TAPT, PTAPT) or the original wav2vec 2.0 fine-tuning. Do the V-FT yield similar results? Do the...
Also, just some experience with working on the MELD audio. - The audio need to be normalized in terms of mean and variance across the utterance. Otherwise the loss may...
You can observe from the training loss, it is not decreasing for V-FT. So the training is not even happening. Something like: ``` wav = (wav - wav.mean()) / (wav.std()...
You may add it at the `__getitem__` of the downstream dataloader But if you run TPAPT/PTAPT, you'll also have to add it at the pretrain dataloader Or you can simply...
No. That is another way of doing normalization for spectral-based features. For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample....
You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as `--lr 2e-5`. [qq.log](https://github.com/b04901014/FT-w2v2-ser/files/7682114/qq.log) Here is some log...
Yeah, maybe it's just the learning rate that matters. Hyper-parameters should be tuned dataset-to-dataset.
Hi, good question! We didn't focus much on this, but we can do the exact same TSA algorithm in NANSY for the speaker conversion model. We can just view the...
Got it. In this case, I second you. It should be an issue of speaker embedding extractor and TSA should help. It should be straightforward to apply TSA on fine-tuning...