Montreal-Forced-Aligner
Montreal-Forced-Aligner copied to clipboard
No option "mfa align -n" for aligning without speaker-adaptation
This might be a feature request, but because it was a feature in MFA 1.0 that no longer seems to be in MFA 2.0, perhaps it is a "bug".
"mfa_align" in MFA V1.0 included an option, "-n, --no_speaker_adaptation", to align clips without speaker-adaptation. This option seems to be missing in MFA V2.0 "mfa align". Using the flag "--no_speaker_adaptation" has no effect on the alignment; MFA2.0 does speaker-adapted training regardless. Perhaps there's a setting in the config.yaml file to skip speaker-adaptation, but if so I'm not aware of it. So this is a request to implement "no speaker adaptation" in V2.0 "mfa align", if it doesn't exist yet.
Ah yes, it changed to the more general configuration flag uses_speaker_adaptation
, so you would use it via mfa align .... --uses_speaker_adaptation False
and that will disable it. I'll try to make that a bit more clear, since it's included in the training configuration docs, but this is the one feature flag that align
uses.
To piggyback off of this bug, I tried running --uses_speaker_adaptation False
but it seemed to still do speaker adaptation.
~ % mfa train --clean AUDIO_DIRECTORY DICTIONARY.xt MODEL.zip OUTPUT_DIRECTORY --beam=100 --uses_speaker_adaptation False
nments --beam=100 --uses_speaker_adaptation False
INFO - Setting up corpus information...
INFO - Loading corpus from source files...
100%|█████████████████████████████████████| 4380/4380 [00:01<00:00, 2978.42it/s]
INFO - Found 3 speakers across 4380 files, average number of utterances per speaker: 1460.0
I think part of the error
is that the audio directory is organized into 3 subfolders (train, dev, test). And MFA is interpretting those by default as 3 speakers. And the --uses_speaker_adaptation False
command isn't overriding it
FYI - I tried running mfa align with --uses_speaker_adaptation false
after reinstalling with the latest bug fixes, but it still appeared to ignore it. Digging deeper, looks like the latest version updates (2.2.16) might not be on conda yet (last update was July 3)?
Ah, thanks for the heads up, there was a build pipeline error with 2.2.16, but that should be resolved and 2.2.17 is now up with its changes (actual code in 2.2.17 is the same as 2.2.16)
Thank you SO much for being all over these bugs - so much appreciated!
Hmmm just reinstalled (running version 3.0.0a3 with --clean flag
), and even with --uses_speaker_adaptation false
it still seems to be ignoring it.
Command:
mfa align --clean input_dir english_us_arpa english_us_arpa output_dir --uses_speaker_adaptation false
Relevant output:
DEBUG Parsed corpus directory with 3 jobs in 0.1782660000000007 seconds
INFO Found 10 speakers across 10 files, average number of utterances per
speaker: 17.2
DEBUG Acoustic model meta information:
DEBUG architecture: gmm-hmm
features:
allow_downsample: true
allow_upsample: true
delta_pitch: 0.005
feature_type: mfcc
frame_length: 25
frame_shift: 10
high_frequency: 7800
low_frequency: 20
max_f0: 500
min_f0: 50
penalty_factor: 0.1
sample_frequency: 16000
snip_edges: true
use_energy: false
use_pitch: false
uses_cmvn: true
uses_deltas: false
uses_speaker_adaptation: true
uses_splices: true
uses_voiced: false
I also tried adding uses_speaker_adaptation: false
to the global_config.yaml
in the MFA folder, but the output in the console was the same.
In this case, my input TextGrids have uniquely named tiers corresponding to speaker IDs. When I run it on input where tiers are all named the same (e.g., utterances), uses_speaker_adaptation
is still ignored, BUT it only identifies 2 speakers, not the actual 10 (similar to the output I was seeing when trying to use --speaker_characters
before (#669) )
Am I using it incorrectly or do you think it's still not recognizing it in mfa align?
It should still be respected, the model debug information there still has it because that's how it was trained, but it won't do the fMLLR speaker adaptation calculation and second pass alignment. If it's still doing those, then let me know, but there should just be one pass of alignment with it set to false. That said, there is still speaker information used in CMVN.
In terms of the tiers, the default behavior in MFA for textgrids is to treat each tier as representing a speaker: https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/corpus_structure.html#textgrid-format, so having a common "utterances" tier will result in less speakers, but speakers are still used without the speaker adaptation pass, because they're the basis for splitting datasets across jobs typically (as well as the basis for calculating CMVN).