transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add X-Codec model

Open Manalelaidouni opened this issue 5 months ago • 3 comments

What does this PR do?

This PR aims at integrating X-Codec model to transformers.

The X-Codec model is a neural audio codec that integrates semantic information from self-supervised models (e.g., HuBERT) alongside traditional acoustic information. This enables :

  • Music continuation : Better modeling of musical semantics yields more coherent continuations.
  • Text-to-Sound Synthesis : X-Codec captures semantic alignment between text prompts and generated audio.
  • Semantic aware audio tokenization: X-Codec is used as an audio tokenizer in the YuE lyrics to song generation model.

X-codec first encodes the audio using an acoustic model (DAC model) and then extracts semantic information using a pretrained Hubert model, this semantic information is further refined with a semantic encoder. The combined features are then fed into a Residual Vector Quantizer (RVQ) that converts the features into discrete codes.

Each individual component reproduce the original X-Codec outputs exactly. Note that I removed the extra padding at the end of the decoded audio_values so that the output length matches that of the input, the produced audio closely resembles the original audio with a 1e-5 tolerance; otherwise the forward pass is identical with the output from the original model.

X-Codec can be now used as a drop in audio tokenizer in YuE as suggested in the design discussion in https://github.com/huggingface/transformers/issues/36784.

Who can review?

@eustlb @zucchini-nlp

Manalelaidouni avatar May 21 '25 04:05 Manalelaidouni