optimum
                                
                                 optimum copied to clipboard
                                
                                    optimum copied to clipboard
                            
                            
                            
                        Add support to export facebook encodec models to ONNX
Feature request
When I try to use optimum-cli to export the facebook/encodec_32khz model I get this error:
%  optimum-cli export onnx --model facebook/encodec_32khz encodec.onnx
Framework not specified. Using pt to export to ONNX.
/Users/micchig/micromamba/envs/music-representation/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Traceback (most recent call last):
  File "/Users/micchig/micromamba/envs/music-representation/bin/optimum-cli", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/micchig/micromamba/envs/music-representation/lib/python3.11/site-packages/optimum/commands/optimum_cli.py", line 163, in main
    service.run()
  File "/Users/micchig/micromamba/envs/music-representation/lib/python3.11/site-packages/optimum/commands/export/onnx.py", line 246, in run
    main_export(
  File "/Users/micchig/micromamba/envs/music-representation/lib/python3.11/site-packages/optimum/exporters/onnx/__main__.py", line 408, in main_export
    raise ValueError(
ValueError: Trying to export a encodec model, that is a custom or unsupported architecture for the task feature-extraction, but no custom onnx configuration was passed as `custom_onnx_configs`. Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the model type encodec to be supported natively in the ONNX export.
I am following the advice in the message and opening an issue here. :)
Motivation
I want to use the encodec model for inference and I'd much rather use ONNX than importing the pretrained model from transformers every time and run it in pytorch as ONNX is much faster.
Your contribution
I'm afraid I can't contribute to this personally
Thank you @giamic, adding it todo :)
Hi @giamic, this one is highly non-trivial. I'm working on it this week.
@xenova @giamic I am planning to export a model whose I/O is the same as https://github.com/huggingface/transformers/blob/f01e1609bf4dba146d1347c1368c8c49df8636f6/src/transformers/models/encodec/modeling_encodec.py#L575 and https://github.com/huggingface/transformers/blob/f01e1609bf4dba146d1347c1368c8c49df8636f6/src/transformers/models/encodec/modeling_encodec.py#L703. Does that sound fine to you for your use cases? Subparts (quantizer, etc.) would not be exported independently.
Thank you @fxmarty ! If I understand correctly, there would be two separate models: EncodecEncoder and EncodecDecoder. The Encoder would take an audio file and output its quantised representation, where every element of the output array would be a codebook index; and the decoder part would take the quantised representation and output an audio file.
I think that this is generally good. I haven't understood whether or not we would be provided some access to the codebooks to map the quantised representation back into the non-quantised latent space. (you said that the quantiser would not be exported independently but maybe it's possible to just write the codebooks to file, so that we could at least to the decoding part of the quantiser ourselves)
@giamic Exactly, specifically, I was thinking there would be (following the above encode & decode functions):
- encodec_encode.onnxthat takes- input_values(audio), returns- encoded_framesof shape- (nb_frames, batch_size, num_quantizers, chunk_length)
- encodec_decode.onnxthat takes- audio_codesinputs (- (1, batch_size, num_quantizers, chunk_length)), returns- audio_values.
I think what you call "codebooks" is audio_codes? So that would be fine.