CTranslate2 icon indicating copy to clipboard operation
CTranslate2 copied to clipboard

Support speech-to-text Transformer model

Open gfiameni opened this issue 2 years ago • 4 comments

Hi! I would like to use ct2-fairseq-converter to translate an existing fairseq trained model. However, since the converter supports a limited set of architectures, wondering what the best approach to add a new model would be.

Thanks!

gfiameni avatar Oct 12 '21 09:10 gfiameni

Hi,

What custom model architecture are you referring to? Can you post the Fairseq options that are used?

If the model architecture is close to a standard Transformer, we may be able to support it and treat this issue as a feature request. Otherwise, you would need to implement the architecture in the code which may be non trivial as it requires experience with C++.

guillaumekln avatar Oct 12 '21 09:10 guillaumekln

Hi @guillaumekln, many thanks for your prompt feedback. I am working with the architecture described here

gfiameni avatar Oct 12 '21 10:10 gfiameni

There are 2 difficulties to fully integrate this model:

  • it uses convolutional layers but our library currently does not have such primitives (they should be implemented)
  • the library API is currently designed for text-to-text tasks so new entrypoints should be added to support speech-to-text

Another approach could be to add a lower-level API to use the underlying Transformer directly, so that you could run the convolutional subsampler with another tool and then call the CTranslate2-accelerated Transformer.

guillaumekln avatar Oct 13 '21 09:10 guillaumekln

Implementing wav2vec 2.0 will be very useful

I think CTranslate2 is the best CPU Transformer inference accelerator, on GPU to for low batch sizes, and comparable for higher If it will manage to get the same great performance improvement for wav2vec 2.0 it will become insane After addition of some transformers it already became the best

NeonBohdan avatar Jul 21 '22 23:07 NeonBohdan

We just released the version 3.0 which integrates the Whisper speech-to-text model published by OpenAI! See a usage example here.

guillaumekln avatar Nov 07 '22 17:11 guillaumekln

I'm closing this issue that I turned into a generic issue for speech-to-text. Feel free to open a new issue if you want to see support for another model architecture than Whisper.

guillaumekln avatar Dec 14 '22 11:12 guillaumekln