FlexSED icon indicating copy to clipboard operation
FlexSED copied to clipboard

open-vocabulary sound event detection

FlexSED: Towards Open-Vocabulary Sound Event Detection

arXiv Hugging Face Models Hugging Face Space

FlexSED is an easy-to-use, open-vocabulary sound event detection (SED) system. It can be used for data annotation, labeling, and developing evaluation metrics for audio generation.

News

  • Oct 2025: 📦 Released code and pretrained checkpoint
  • Sep 2025: 🎉 FlexSED Spotlighted at WASPAA 2025

Installation

Clone the repository:

git clone https://github.com/JHU-LCAP/FlexSED.git 

Install the dependencies:

cd FlexSED
pip install -r requirements.txt

Usage

from api import FlexSED
import torch
import soundfile as sf

# load model
flexsed = FlexSED(device='cuda')

# run inference
events = ["Door", "Male Speech", "Laughter", "Dog"]
preds = flexsed.run_inference("example.wav", events)

# visualize prediciton
flexsed.to_multi_plot(preds, events, fname="example")

# (Optional) visualize prediciton by video
# flexsed.to_multi_video(preds, events, audio_path="example.wav", fname="example")

Training

  1. Download the AudioSet-Strong subset. The dataset is available from both WavCaps and HF-AS-Strong. Thanks to the contributors for providing these resources.

  2. Prepare metadata following the preprocessing steps. Feel free to check processed metadata.

    (If you wish to create a validation split, remove a subset of samples from the training metadata and format them the same as the test metadata. Recommended: ~2000 samples across ~50 sound classes.)

  3. Update file paths for both metadata and audio in src/configs.

  4. Extract CLAP embeddings

    python src/prepare_clap.py
    
  5. Run training:

    python src/train.py
    

Reference

If you find the code useful for your research, please consider citing:

@article{hai2025flexsed,
  title={FlexSED: Towards Open-Vocabulary Sound Event Detection},
  author={Hai, Jiarui and Wang, Helin and Guo, Weizhe and Elhilali, Mounya},
  journal={arXiv preprint arXiv:2509.18606},
  year={2025}
}