FlexSED
FlexSED copied to clipboard
open-vocabulary sound event detection
FlexSED: Towards Open-Vocabulary Sound Event Detection
FlexSED is an easy-to-use, open-vocabulary sound event detection (SED) system. It can be used for data annotation, labeling, and developing evaluation metrics for audio generation.
News
- Oct 2025: 📦 Released code and pretrained checkpoint
- Sep 2025: 🎉 FlexSED Spotlighted at WASPAA 2025
Installation
Clone the repository:
git clone https://github.com/JHU-LCAP/FlexSED.git
Install the dependencies:
cd FlexSED
pip install -r requirements.txt
Usage
from api import FlexSED
import torch
import soundfile as sf
# load model
flexsed = FlexSED(device='cuda')
# run inference
events = ["Door", "Male Speech", "Laughter", "Dog"]
preds = flexsed.run_inference("example.wav", events)
# visualize prediciton
flexsed.to_multi_plot(preds, events, fname="example")
# (Optional) visualize prediciton by video
# flexsed.to_multi_video(preds, events, audio_path="example.wav", fname="example")
Training
-
Download the AudioSet-Strong subset. The dataset is available from both WavCaps and HF-AS-Strong. Thanks to the contributors for providing these resources.
-
Prepare metadata following the preprocessing steps. Feel free to check processed metadata.
(If you wish to create a validation split, remove a subset of samples from the training metadata and format them the same as the test metadata. Recommended: ~2000 samples across ~50 sound classes.)
-
Update file paths for both metadata and audio in
src/configs. -
Extract CLAP embeddings
python src/prepare_clap.py -
Run training:
python src/train.py
Reference
If you find the code useful for your research, please consider citing:
@article{hai2025flexsed,
title={FlexSED: Towards Open-Vocabulary Sound Event Detection},
author={Hai, Jiarui and Wang, Helin and Guo, Weizhe and Elhilali, Mounya},
journal={arXiv preprint arXiv:2509.18606},
year={2025}
}