transformers
transformers copied to clipboard
Add Microsoft CLAP model
What does this PR do ?
This PR aims at adding the Microsoft CLAP (MSClap) model to Transformers. The architecture can be decomposed in two parts:
The first part contains:
- A GPT-2 Text Encoder, based on the Transformers GPT-2 model.
- An Audio Encoder, based on the HTS-AT architecture.
It can be used mostly for zero-shot audio classification or audio retrieval.
The second part adds:
- A mapper model which maps the audio embeddings to a GPT2 input sequence.
- A GPT-2 text decoder (also based on the Transformers model) This second architecture can be used to perform audio captioning.
For now, in this PR, we will only add the first part (text encoder + audio encoder) of the architecture. I will add the second part in a following PR.
What have been done for now
- [x] Adapt the audio model using the current CLAP architecture in Transformers (laion clap).
- [x] Integrate the Text encoder model.
- [x] Successfully converted checkpoints and pushed them to the Hub.
- [x] Added Config files, Feature Extractor and Preprocessor.
- [x] Made sure we get the same output from the FeatureExtractor / Processor and Models as we did from the original MS Clap.
- [x] To Do: Add tests.