The Performance of the new models are bad for specific languages
Thank you for creating e2v. how can i access the previous model that could only output a few labels instead of 9? I find this new ckpt (the plus large) to be so much worse compared to the old one at least for Persian.
the model also hallucinates a lot with short inputs (1-2 seconds) even in English.
You can modify the logits to specific emotions(such as 5) by masking the emotions you don't need. You will get similar performance with the previous model.
If I use the feature vectors ('feats') generated by the Automodel library's model.generate function on audio files as input to train a new model for Speech emotion recognition, is this process equivalent to fine-tuning or training a downstream model for speech emotion recogniton ? Are these features equivalent to embeddings or raw audio features ?
If I use the feature vectors ('feats') generated by the Automodel library's model.generate function on audio files as input to train a new model for Speech emotion recognition, is this process equivalent to fine-tuning or training a downstream model for speech emotion recogniton ? Are these features equivalent to embeddings or raw audio features ?
I did not get your idea clearly. We provide emotion2vec for extracting features and emotion2vec+ for classification. And both types of the model provide embeddings for further exploration of your tasks.