DIFFA
DIFFA copied to clipboard
[AAAI 2026] DIFFA: Large Language Diffusion Models Can Listen and Understand
DIFFA: Large Language Diffusion Models Can Listen and Understand
DIFFA is the first diffusion-based large audio-language model (LALM) for spoken language understanding.
It leverages a frozen diffusion LLM with dual adapters (semantic + acoustic) to enhance audio perception and reasoning.
As the first exploration of diffusion-based large language models (dLLMs) in speech and audio understanding, DIFFA opens new directions for non-autoregressive multimodal learning.
This repository provides the training data, checkpoints, inference scripts, and reproducible training pipelines to facilitate further research on diffusion LLMs in the audio domain.
🔥 News
- 2025.11.11: DIFFA is accepted by AAAI 2026 !
- 2025.08.25: Released the DIFFA checkpoint and code!
- 2025.07.25: Our paper is released on arXiv. 🎉
🚀 Overview
Despite using only 960h ASR and 127h synthetic instruction data, DIFFA achieves competitive results compared to models trained on hundreds of thousands of hours.
Figure: Radar chart comparing DIFFA and Qwen2-Audio-Instruct across multiple audio-language benchmarks.
⚙️ Setup
Python Environment
git clone https://github.com/NKU-HLT/DIFFA.git
cd DIFFA
conda create -n diffa python=3.10
conda activate diffa
pip install -r requirements.txt
Checkpoints
Please download and set up the following models:
Update llm_path, whisper_path, and model_path in the inference scripts before running.
🔍 Inference
We provide inference code for the following benchmarks:
Example (MMSU):
bash run_mmsu_inference.sh
After inference, run evaluate.py for each benchmark to compute final metrics.
⚠️ Note on Inference Speed
Currently, DIFFA’s inference is slower than autoregressive audio-language models. This is mainly because its backbone LLaDA has not yet been optimized for efficiency. In particular, diffusion-based LLMs lack KV-cache support and parallel decoding, which makes decoding slower compared to autoregressive models. Since this work is the first exploration of diffusion LLMs in the audio domain, our focus is on evaluating performance rather than optimizing speed. If you are interested in acceleration, we recommend looking into recent training-free methods such as Fast-dLLM, which report 27.6× faster inference and represent a promising direction for future integration.
📖 Training
We provide training scripts for reimplementation.
Data Preparation
- Stage 1: LibriSpeech
- Stage 2: VoxCeleb1, AccentDB, IEMOCAP, DailyTalk, VCTK-Corpus
Data format and indices are available on Hugging Face.
Training Script
# Stage 1
bash train_stage1.sh
# Stage 2
bash train_stage2.sh
🙏 Acknowledgements
We sincerely thank the following open-source projects and authors for their contributions, which greatly inspired and facilitated this work:
These open-source efforts have greatly inspired and supported the development of DIFFA.
📖 Citation
If you find DIFFA useful, please cite:
@article{zhou2025diffa,
title={DIFFA: Large Language Diffusion Models Can Listen and Understand},
author={Zhou, Jiaming and Chen, Hongjie and Zhao, Shiwan and Kang, Jian and Li, Jie and Wang, Enzhi and Guo, Yujie and Sun, Haoqin and Wang, Hui and Kong, Aobo and others},
journal={arXiv preprint arXiv:2507.18452},
year={2025}
}