emotion2vec icon indicating copy to clipboard operation
emotion2vec copied to clipboard

ONNX?

Open altunenes opened this issue 1 year ago • 9 comments

I've been working with the emotion2vec model and trying to convert it to ONNX format for deployment purposes. The current implementation is great for PyTorch users, but having ONNX support would enable broader deployment options.

I tried converting the model using torch.onnx.export with various approaches:

Direct conversion of the AutoModel Creating a wrapper around the model components Implementing custom forward passes

Main challenges encountered:

Dimension mismatches in the conv1d layers Issues with the masking mechanism Difficulties preserving the complete model architecture Problems with tensor handling between components

Could you please provide guidance on the correct architecture for ONNX conversion Including an example of proper tensor dimensionality through the model? I have converted torch vision models to Onnx before, but the audio models seemed a bit complicated to me :/

thank you very much your work it works really nice!

also see: https://github.com/modelscope/FunASR/issues/1690

altunenes avatar Nov 14 '24 16:11 altunenes

We did not provide onnx model. Welcome contribute :)

ddlBoJack avatar Nov 18 '24 10:11 ddlBoJack

We did not provide onnx model. Welcome contribute :)

I'm currently working to understand the model's inputs and outputs. Could you provide detailed information to help others add onnx support? Specifically, I need the exact input and output details. Thanks.

Update: this is the example I was able to run with this repo

'''
Using the emotion representation model
rec_result only contains {'feats'}
	granularity="utterance": {'feats': [*768]}
	granularity="frame": {feats: [T*768]}
 
python main.py
'''

from funasr import AutoModel
import json
from collections import OrderedDict

# Load the finetuned emotion recognition model
model = AutoModel(model="iic/emotion2vec_base_finetuned")
mapper = ["angry", "disgusted", "fearful", "happy", "neutral", "other", "sad", "surprised", "unknown"]
wav_file = f"audio.wav"
rec_result = model.generate(wav_file, granularity="utterance")
scores = rec_result[0]['scores']

# Prepare the result mapping with emotions and their probabilities
result = {emotion: float(prob) for emotion, prob in zip(mapper, scores)}
# Sort the result in descending order of probability
sorted_result = OrderedDict(sorted(result.items(), key=lambda item: item[1], reverse=True))
print(json.dumps(sorted_result, indent=4))

I didn't find any working example in the repo and had to play with it. I guess we should understand better how funasr/models/emotion2vec/model.py works. Basically we need to understand:

  1. what's the expected length and format of audio segment (wav)
  2. what's the features we need to extract from it
  3. how to pass these features to the model
  4. how to parse the output back to labels and probabilities

thewh1teagle avatar Dec 05 '24 02:12 thewh1teagle

I second this - it would be great to understand the details required to make an ONNX - much appreciated @ddlBoJack if you can help us out ?

oddpxl avatar Dec 08 '24 17:12 oddpxl

We did not provide onnx model. Welcome contribute :)

I'm currently working to understand the model's inputs and outputs. Could you provide detailed information to help others add onnx support? Specifically, I need the exact input and output details. Thanks.

Update: this is the example I was able to run with this repo

'''
Using the emotion representation model
rec_result only contains {'feats'}
	granularity="utterance": {'feats': [*768]}
	granularity="frame": {feats: [T*768]}
 
python main.py
'''

from funasr import AutoModel
import json
from collections import OrderedDict

# Load the finetuned emotion recognition model
model = AutoModel(model="iic/emotion2vec_base_finetuned")
mapper = ["angry", "disgusted", "fearful", "happy", "neutral", "other", "sad", "surprised", "unknown"]
wav_file = f"audio.wav"
rec_result = model.generate(wav_file, granularity="utterance")
scores = rec_result[0]['scores']

# Prepare the result mapping with emotions and their probabilities
result = {emotion: float(prob) for emotion, prob in zip(mapper, scores)}
# Sort the result in descending order of probability
sorted_result = OrderedDict(sorted(result.items(), key=lambda item: item[1], reverse=True))
print(json.dumps(sorted_result, indent=4))

I didn't find any working example in the repo and had to play with it. I guess we should understand better how funasr/models/emotion2vec/model.py works. Basically we need to understand:

  1. what's the expected length and format of audio segment (wav)
  2. what's the features we need to extract from it
  3. how to pass these features to the model
  4. how to parse the output back to labels and probabilities

Thank you for contributing the ONNX model of emotion2vec.

  1. There is no limit on the length of audio, because the prediction is the output of the pooling layer.
  2. The audio is 16khz single-channel wav.
  3. I'm not quite sure what you mean by "features". For the finetuned model, raw wav can be used for forward, without the need to extract mel or fbank features.
  4. You can refer to the implementation of FunASR for label mapping

ddlBoJack avatar Dec 09 '24 10:12 ddlBoJack

  1. There is no limit on the length of audio, because the prediction is the output of the pooling layer.
  2. The audio is 16khz single-channel wav.
  3. I'm not quite sure what you mean by "features". For the finetuned model, raw wav can be used for forward, without the need to extract mel or fbank features.
  4. You can refer to the implementation of FunASR for label mapping

Cool I didn't knew that the input can be 16khz wav directly. Which one is the finetuned model I should use?

I tried to convert the .pt file to onnx but it missing some metadata. I guess I need the pytorch class that represent the model. where can I find it? As for the output, what's the dimension of the output so I can convert it back to labels? It will be awesome if you can provide as much info as you can regarding the input (wav -> model) and the output (some matrix -> labels) assuming I'm completely dumb. thanks

thewh1teagle avatar Dec 10 '24 19:12 thewh1teagle

We did not provide onnx model. Welcome contribute :)

Could you try the work I did on https://github.com/modelscope/FunASR/pull/2359

takipipo avatar Jan 10 '25 08:01 takipipo

Could you try the work I did on modelscope/FunASR#2359

I see that you couldn't include LayerNorm. did you resolved it?

thewh1teagle avatar Jan 10 '25 14:01 thewh1teagle

Could you try the work I did on modelscope/FunASR#2359

I see that you couldn't include LayerNorm. did you resolved it?

I've included LayerNorm can you link to the issue?

takipipo avatar Jan 13 '25 05:01 takipipo

I created thewh1teagle/emotion2onnx Still not sure how to get the inputs/outputs correctly

It's really easy to setup it.

git clone https://github.com/thewh1teagle/emotion2onnx
cd emotion2onnx
wget https://github.com/thewh1teagle/emotion2onnx/releases/download/model-files/emotion2vec.onnx
# Install uv from https://docs.astral.sh/uv/getting-started/installation
uv run examples/usage.py

thewh1teagle avatar Jan 17 '25 05:01 thewh1teagle