Montreal-Forced-Aligner icon indicating copy to clipboard operation
Montreal-Forced-Aligner copied to clipboard

No option "mfa align -n" for aligning without speaker-adaptation

Open ChristopherLandreth opened this issue 1 year ago • 1 comments

This might be a feature request, but because it was a feature in MFA 1.0 that no longer seems to be in MFA 2.0, perhaps it is a "bug".

"mfa_align" in MFA V1.0 included an option, "-n, --no_speaker_adaptation", to align clips without speaker-adaptation. This option seems to be missing in MFA V2.0 "mfa align". Using the flag "--no_speaker_adaptation" has no effect on the alignment; MFA2.0 does speaker-adapted training regardless. Perhaps there's a setting in the config.yaml file to skip speaker-adaptation, but if so I'm not aware of it. So this is a request to implement "no speaker adaptation" in V2.0 "mfa align", if it doesn't exist yet.

ChristopherLandreth avatar Jul 06 '22 19:07 ChristopherLandreth

Ah yes, it changed to the more general configuration flag uses_speaker_adaptation, so you would use it via mfa align .... --uses_speaker_adaptation False and that will disable it. I'll try to make that a bit more clear, since it's included in the training configuration docs, but this is the one feature flag that align uses.

mmcauliffe avatar Aug 07 '22 22:08 mmcauliffe

To piggyback off of this bug, I tried running --uses_speaker_adaptation False but it seemed to still do speaker adaptation.

 ~ % mfa train --clean AUDIO_DIRECTORY   DICTIONARY.xt MODEL.zip  OUTPUT_DIRECTORY   --beam=100 --uses_speaker_adaptation False

nments --beam=100 --uses_speaker_adaptation False

INFO - Setting up corpus information...
INFO - Loading corpus from source files...
100%|█████████████████████████████████████| 4380/4380 [00:01<00:00, 2978.42it/s]
INFO - Found 3 speakers across 4380 files, average number of utterances per speaker: 1460.0

I think part of the error is that the audio directory is organized into 3 subfolders (train, dev, test). And MFA is interpretting those by default as 3 speakers. And the --uses_speaker_adaptation False command isn't overriding it

jhdeov avatar Nov 08 '22 03:11 jhdeov

FYI - I tried running mfa align with --uses_speaker_adaptation false after reinstalling with the latest bug fixes, but it still appeared to ignore it. Digging deeper, looks like the latest version updates (2.2.16) might not be on conda yet (last update was July 3)?

thealk avatar Aug 19 '23 11:08 thealk

Ah, thanks for the heads up, there was a build pipeline error with 2.2.16, but that should be resolved and 2.2.17 is now up with its changes (actual code in 2.2.17 is the same as 2.2.16)

mmcauliffe avatar Aug 20 '23 21:08 mmcauliffe

Thank you SO much for being all over these bugs - so much appreciated!

Hmmm just reinstalled (running version 3.0.0a3 with --clean flag), and even with --uses_speaker_adaptation false it still seems to be ignoring it.

Command:

mfa align --clean input_dir english_us_arpa english_us_arpa output_dir --uses_speaker_adaptation false

Relevant output:

DEBUG    Parsed corpus directory with 3 jobs in 0.1782660000000007 seconds     
 INFO     Found 10 speakers across 10 files, average number of utterances per   
          speaker: 17.2    
 DEBUG    Acoustic model meta information:                                      
 DEBUG    architecture: gmm-hmm                                                 
          features:                                                             
            allow_downsample: true                                              
            allow_upsample: true                                                
            delta_pitch: 0.005                                                  
            feature_type: mfcc                                                  
            frame_length: 25                                                    
            frame_shift: 10                                                     
            high_frequency: 7800                                                
            low_frequency: 20                                                   
            max_f0: 500                                                         
            min_f0: 50                                                          
            penalty_factor: 0.1                                                 
            sample_frequency: 16000                                             
            snip_edges: true                                                    
            use_energy: false                                                   
            use_pitch: false                                                    
            uses_cmvn: true                                                     
            uses_deltas: false                                                  
            uses_speaker_adaptation: true                                       
            uses_splices: true                                                  
            uses_voiced: false   

I also tried adding uses_speaker_adaptation: false to the global_config.yaml in the MFA folder, but the output in the console was the same. In this case, my input TextGrids have uniquely named tiers corresponding to speaker IDs. When I run it on input where tiers are all named the same (e.g., utterances), uses_speaker_adaptation is still ignored, BUT it only identifies 2 speakers, not the actual 10 (similar to the output I was seeing when trying to use --speaker_characters before (#669) )

Am I using it incorrectly or do you think it's still not recognizing it in mfa align?

thealk avatar Aug 28 '23 18:08 thealk

It should still be respected, the model debug information there still has it because that's how it was trained, but it won't do the fMLLR speaker adaptation calculation and second pass alignment. If it's still doing those, then let me know, but there should just be one pass of alignment with it set to false. That said, there is still speaker information used in CMVN.

In terms of the tiers, the default behavior in MFA for textgrids is to treat each tier as representing a speaker: https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/corpus_structure.html#textgrid-format, so having a common "utterances" tier will result in less speakers, but speakers are still used without the speaker adaptation pass, because they're the basis for splitting datasets across jobs typically (as well as the basis for calculating CMVN).

mmcauliffe avatar Aug 28 '23 18:08 mmcauliffe