fairseq
fairseq copied to clipboard
Why wav2vec2-base-960h is trained without using attention mask?
I have seen the code of Wav2Vec2FeatureExtractor in transformers, and it said the model wav2vec2-base-960h
is trained without using attention mask.
I wonder why and how the model is trained without using attention mask to mask the pad place.
Does not it make mistakes when allowing the pad place into compute?