nnsvs
nnsvs copied to clipboard
Discussion: NNSVS vs. NEUTRINO
Samples: https://soundcloud.com/r9y9/sets/nnsvs-and-neutrino-comparison
While I was looking into the differences from nnsvs and neutrino samples, I noticed that there are MUCH room for improvement in the acoustic model. I will put some analysis results for the record.
Global variance

Spectrogram
Upper: nnsvs, lower: neutrino

Looks like neutrino put emphasis on <8000 Hz frequency bands
Aperiodicity
Upper: nnsvs, lower: neutrino

It seems that neutrino performs phrase-level synthesis (separated by rests I guess?). Aperiodicity components are filled with constant values for pause.
F0

MGC

- mgc 0th: ours are shifted. This is not important cause gain of signals are different at training.
- mgc higher dims: Clearly ours are smoothed. Temporal fluctuations are clearly observed for neutrino, but not for nnsvs.
BAP

- Same as mgc, ours are over-smoothed
So what can we do?
So far I am thinking of the following ideas
- Try autoregressive models to alleviate over-smoothing issues for mgc/bap modeling #15
- Design a post-filter to alleviate the over-smoothing issues. I guess modulation spectrum based post-filter would work to some extent.
I wonder why higher mgc(s) and bap generated by GAN-based model are over-smoothed. Is there any possibility that MLPG contributes this over-smoothing?
I think phrase-level synthesis of neutrino may be to avoid the shortage of GPU memory.
I suspect MLPG causes over-smoothing. I tried disabling MLPG but it actually did cause quality degradation. In particular, generated F0 became too flat. Maybe it would be worth trying to disable MLPG for spectral features (mgc and bap) and enable it for F0. Also, note that the GAN-based model is still in an experimental stage. I am still struggling to make it work good.
Yes, phrase-level synthesis could be useful to avoid GPU out-of-memory error when using NSF. It would also be useful if we use modulation spectrum based post-filter (search segment-level post-filter in https://ahcweb01.naist.jp/papers/journal/2016/201604_TASLP_Takamichi_1/201604_TASLP_Takamichi_1.paper.pdf)
Thank you for your rapid resnponse. I'm sorry but I misunderstood that the acoustic model of NNSVS was GAN-based because the graph legends of MGC and BAP(I re-checked the descriptions of samples at soundcloud).
And thank you for the information about modulation spectrum based post-filter. I'll read the paper.
Sorry that's my bad. I didn't include any detailed information in the description. Some notes:
baseline: a baseline ResSkipF0FFConvLSTM modelgan: my attempt to integrate GAN for training ResSkipF0FFConvLSTM model (not very good at the moment)neutrinoneutrino.
For spectrogram/aperiodicity/F0, I used the baseline model. For mgc/bap, I used both the baseline and gan for comparison.
A good news: I've done an initial cut for MS post-filter and here is the spectrogram example:
From top to bottom: gan, gan with MS post-filter, neutrino

Findings so far:
- I got very similar patterns with neutrino by the MS-based post-filter. It's likely that neutrino also uses a similar (or same) post-filtering technique.
- Over-smoothing can be alleviated by the MS-based post-filter.
An illustration for 50-dim mgc with and without post-filter:


Top: NNSVS (w/ GAN-based post-filter) Bottom: Neutrino
My bad; previous spectrogram visualization was wrong. I was assuming that neutrino uses the same mgc as ours, but it turned out they use a slightly differnet approach. Specifically,
- Neutrino:
pyworld.code_spectral_envelope(or C++ version of its impl) to convert spectral envelope to mgc - nnsvs:
pysptk.sp2mcto convert spectral envelope to mgc
I suppose there's no big difference, but we may want to try the same approach as Neutrino to see if it actually makes difference.
https://github.com/nnsvs/nnsvs/issues/1#issuecomment-1332554913
I'll report a more detailed comparison by Jan 2023. I'll have a long vacation for a while.