Li-Wei Chen comments

Results 19 comments of


                                            Li-Wei Chen

perceptual similarity implementation

Hi, Sorry for the late reply. It's been years since I visit this codebase. If you are referring to the Eq. 4 in the paper, I think it's at https://github.com/b04901014/ISGAN/blob/master/src/model.py#L176...

The result on MELD dataset is not as good as IEMOCAP

Some guidelines for debugging. - Identify if the problem is from the proposed algorithm (TAPT, PTAPT) or the original wav2vec 2.0 fine-tuning. Do the V-FT yield similar results? Do the...

The result on MELD dataset is not as good as IEMOCAP

Also, just some experience with working on the MELD audio. - The audio need to be normalized in terms of mean and variance across the utterance. Otherwise the loss may...

The result on MELD dataset is not as good as IEMOCAP

You can observe from the training loss, it is not decreasing for V-FT. So the training is not even happening. Something like: ``` wav = (wav - wav.mean()) / (wav.std()...

The result on MELD dataset is not as good as IEMOCAP

You may add it at the `__getitem__` of the downstream dataloader But if you run TPAPT/PTAPT, you'll also have to add it at the pretrain dataloader Or you can simply...

The result on MELD dataset is not as good as IEMOCAP

No. That is another way of doing normalization for spectral-based features. For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample....

The result on MELD dataset is not as good as IEMOCAP

You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as `--lr 2e-5`. [qq.log](https://github.com/b04901014/FT-w2v2-ser/files/7682114/qq.log) Here is some log...

The result on MELD dataset is not as good as IEMOCAP

Yeah, maybe it's just the learning rate that matters. Hyper-parameters should be tuned dataset-to-dataset.

Test time speaker adaptation

Hi, good question! We didn't focus much on this, but we can do the exact same TSA algorithm in NANSY for the speaker conversion model. We can just view the...

Test time speaker adaptation

Got it. In this case, I second you. It should be an issue of speaker embedding extractor and TSA should help. It should be straightforward to apply TSA on fine-tuning...