YWMditto
YWMditto
I have seen the code of Wav2Vec2FeatureExtractor in transformers, and it said the model `wav2vec2-base-960h` is trained without using attention mask. I wonder why and how the model is trained...
For example, truncating the loss difference to 0 does not seem to be implemented.  https://github.com/princeton-nlp/LLM-Shearing/blob/1386c8f69cfb3bf64896959cf3754d2bf87659c7/llmshearing/callbacks/dynamic_loading_callback.py#L34 And, what is the purpose of this line? https://github.com/princeton-nlp/LLM-Shearing/blob/1386c8f69cfb3bf64896959cf3754d2bf87659c7/llmshearing/callbacks/dynamic_loading_callback.py#L41