community-events
community-events copied to clipboard
WhisperPositionalEmbedding
Hi there I'm trying to fintune whisper model but there is a problem that decoder positional embedding size(small model case is [448,768]) should not bigger than 448(first dim) I have two question Q1) When I use a 10 second or more long wav file, that problem let stop training.. is it problem related to file size..?
prob code line is below
# embed positions
positions = self.embed_positions(input_ids, past_key_values_length=past_key_values_length)
hidden_states = inputs_embeds + positions
in transformers/models/whisper/modeling_whisper.py:872 is stopped line if I change the max_target_positions then I use random embedding layer instead existing whisper's embedding layer.. Q2) let me know any solution..?
Hey @Macsim2! Best thing to do here is to filter any transcriptions longer than Whisper's max length, see https://discuss.huggingface.co/t/open-to-the-community-whisper-fine-tuning-event/26681/21?u=sanchit-gandhi