sound_ai_progress icon indicating copy to clipboard operation
sound_ai_progress copied to clipboard

Tracking states of the arts and recent results (bibliography) on sound tasks.

Sound AI progress

Tracking states of the arts and recent results (bibliography) on sound AI topics and audio tasks. Feel free to create PRs for new results!

Inspired by wer_are_we and are_we_there_yet

Sound AI or Audio Analytics

Sound AI or Audio Analytics focuses on analyzing and understanding audio signals captured by digital devices, with numerous applications in health & wellbeing, environmental sensing, urban living, and the creative sector.

Table of Contents

  • Sound Event Classification
  • Acoustic Scene Classification
  • Audio Captioning
  • Text to Audio Retrieval
  • Audio to Text Retrieval
  • Music Classification

Sound Event Classification

AudioSet

Title Notes mAP Paper Code
BEATs: Audio Pre-Training with Acoustic Tokenizers iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers [ensemble] 0.506 chen22 :scroll:
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST [ensemble] 0.496 koutini22 :scroll:
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules [ensemble] 0.487 chen2022 :scroll:
BEATs: Audio Pre-Training with Acoustic Tokenizers iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers 0.486 chen22 :scroll:
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet [ensemble] 0.485 gong2021 :scroll:
Masked Autoencoders that Listen extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms 0.473 huang2022 :scroll:
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST [non-ensemble] 0.471 koutini22 :scroll:
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules [non-ensemble] 0.471 chen2022 :scroll:
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet [non-ensemble] 0.459 gong2021 :scroll:
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition CNN models trained on AudioSet 0.439 kong2019 :scroll:
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks Conformer-based self-supervised learning 0.415 srivastava2022

FSD50K

Title Notes mAP Paper Code
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST 0.653 koutini22 :scroll:
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 0.649 wu2022 :scroll:
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 0.5859 elizalde2022 :scroll:
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP 0.4308 wu2021 :scroll:

ESC50

Title Notes Accuracy Paper Code
BEATs: Audio Pre-Training with Acoustic Tokenizers iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers 98.1% chen22 :scroll:
Masked Autoencoders that Listen Image-based MAE for audio spectrograms 97.4% huang2022 :scroll:
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection Transformer model with hierarchical structure and token-semantic modules 97.00% chen2022 :scroll:
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST 96.8% koutini22 :scroll:
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 96.70% elizalde2022 :scroll:
AST: Audio Spectrogram Transformer Pure Attention Model Pretrained on AudioSet 95.70% gong2021 :scroll:
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer A Transformer model pretrained w/ visual image supervision 95.70% zhao2022 :scroll:
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition Multi-stage sequential learning with knowledge transfer from Audioset 94.10% kumar2020
Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications CNN model pretrained on AudioSet 92.32% lopez-meyer2021
Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks Pretrained model with multi-channel features 89.50% kim2020 :scroll:
An Ensemble of Convolutional Neural Networks for Audio Classification CNN ensemble with data augmentation 88.65% nanni2020 :scroll:
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices CNN model (ACDNet) with potential compression 87.1% mohaimenuzzaman2021 :scroll:
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies 86.50% sailor2017
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP 85.95% wu2021 :scroll:
AclNet: efficient end-to-end audio classification CNN CNN with mixup and data augmentation 85.65% huang2018
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications x-vector network with openll3 embeddings 85.00% wilkinghoff2020
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning 84.90% tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies 84.15% tak2017
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes CNN pretrained on AudioSet 83.50% kumar2017 :scroll:
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM + fusion with GTSC 83.00% sailor2017
Deep Multimodal Clustering for Unsupervised Audiovisual Learning CNN + unsupervised audio-visual learning 82.60% hu2019
Novel TEO-based Gammatone Features for Environmental Sound Classification Fusion of GTSC & TEO-GTSC with CNN 81.95% agrawal2017
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + Between-Class learning 81.80% tokozume2017b
:headphones: Human accuracy Crowdsourcing experiment in classifying ESC-50 by human listeners 81.30% piczak2015a :scroll:
Objects that Sound Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule 79.80% arandjelovic2017b
Look, Listen and Learn 8-layer convolutional subnetwork pretrained on an audio-visual correspondence task 79.30% arandjelovic2017a
Learning Environmental Sounds with Multi-scale Convolutional Neural Network Multi-scale convolutions with feature fusion (waveform + spectrogram) 79.10% zhu2018
Novel TEO-based Gammatone Features for Environmental Sound Classification GTSC with CNN 79.10% agrawal2017
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) + data augmentation 78.80% tokozume2017b
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification CNN with filterbanks learned using convolutional RBM 78.45% sailor2017
Learning from Between-class Examples for Deep Sound Recognition Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning 76.90% tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound Classification TEO-GTSC with CNN 74.85% agrawal2017
Learning from Between-class Examples for Deep Sound Recognition EnvNet-v2 (tokozume2017a) 74.40% tokozume2017b
Soundnet: Learning sound representations from unlabeled video 8-layer CNN (raw audio) with transfer learning from unlabeled videos 74.20% aytar2016 :scroll:
Learning from Between-class Examples for Deep Sound Recognition 18-layer CNN on raw waveforms (dai2016) + Between-Class learning 73.30% tokozume2017b
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification CNN working with phase encoded mel filterbank energies (PEFBEs) 73.25% tak2017
Classifying environmental sounds using image recognition networks 16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length) 73.20% boddapati2017 :scroll:
Learning from Between-class Examples for Deep Sound Recognition Baseline CNN (piczak2015b) + Batch Normalization 72.40% tokozume2017b
Novel TEO-based Gammatone Features for Environmental Sound Classification Fusion of MFCC & TEO-GTCC with GMM 72.25% agrawal2017
Learning environmental sounds with end-to-end convolutional neural network (EnvNet) Combination of spectrogram and raw waveform CNN 71.00% tokozume2017a
Novel TEO-based Gammatone Features for Environmental Sound Classification TEO-GTCC with GMM 68.85% agrawal2017
Classifying environmental sounds using image recognition networks 16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) 68.70% boddapati2017 :scroll:
Very Deep Convolutional Neural Networks for Raw Waveforms 18-layer CNN on raw waveforms 68.50% dai2016, tokozume2017b :scroll:
Classifying environmental sounds using image recognition networks 32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length) 67.80% boddapati2017 :scroll:
WSNet: Learning Compact and Efficient Networks with Weight Sampling SoundNet 8-layer CNN architecture with 100x model compression 66.25% jin2017
Soundnet: Learning sound representations from unlabeled video 5-layer CNN (raw audio) with transfer learning from unlabeled videos 66.10% aytar2016 :scroll:
WSNet: Learning Compact and Efficient Networks with Weight Sampling SoundNet 8-layer CNN architecture with 180x model compression 65.80% jin2017
Soundnet: Learning sound representations from unlabeled video 5-layer CNN trained on raw audio of ESC-50 only 65.00% aytar2016 :scroll:
:bar_chart: Environmental Sound Classification with Convolutional Neural Networks - CNN baseline CNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer 64.50% piczak2015b :scroll:
auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks MLP classifier on features extracted with an RNN autoencoder 64.30% freitag2017 :scroll:
Classifying environmental sounds using image recognition networks 32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) 63.20% boddapati2017 :scroll:
Classifying environmental sounds using image recognition networks CRNN 60.30% boddapati2017 :scroll:
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks 3-layer CNN with vertical filters on wideband mel-STFT (median accuracy) 56.37% huzaifah2017
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks 3-layer CNN with square filters on wideband mel-STFT (median accuracy) 54.00% huzaifah2017
Soundnet: Learning sound representations from unlabeled video 8-layer CNN trained on raw audio of ESC-50 only 51.10% aytar2016 :scroll:
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks 5-layer CNN with square filters on wideband mel-STFT (median accuracy) 50.87% huzaifah2017
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks 5-layer CNN with vertical filters on wideband mel-STFT (median accuracy) 46.25% huzaifah2017
:bar_chart: Baseline - random forest Baseline ML approach (MFCC & ZCR + random forest) 44.30% piczak2015a :scroll:
Soundnet: Learning sound representations from unlabeled video Convolutional autoencoder trained on unlabeled videos 39.90% aytar2016 :scroll:
:bar_chart: Baseline - SVM Baseline ML approach (MFCC & ZCR + SVM) 39.60% piczak2015a :scroll:
:bar_chart: Baseline - k-NN Baseline ML approach (MFCC & ZCR + k-NN) 32.20% piczak2015a :scroll:

US8K

Title Notes Accuracy Paper Code
AudioCLIP: Extending CLIP to Image, Text and Audio incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset 90.07% guzhov2021 :scroll:
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 87.96% elizalde2022 :scroll:
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP 81.01% wu2021 :scroll:

VocalSound

Title Notes Accuracy Paper Code
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 97.95% elizalde2022 :scroll:
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition EfficientNetB0 90.5% gong2022 :scroll:

VGGSound

Title Notes Accuracy Paper Code
Slow-Fast Auditory Streams For Audio Recognition two-stream convolutional network for audio recognition 54.4% kazakos2022 :scroll:
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP 46.63% wu2021 :scroll:

Acoustic Scene Classification

Audio Captioning

AudioCaps

Title Notes SPIDEr Paper Code
Audio Captioning Transformer Transformer network based on an encoder-decoder architecture 0.426 mei2021 :scroll:

Clotho

Title Notes SPIDEr Paper Code
WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information two-branch audio encoder for learning temporal and local time-frequency information 0.182 tran2020 :scroll:

Text to Audio Retrieval

AudioCaps

Title Notes mAP@10 R@1 Paper Code
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 36.7 wu2022 :scroll:
Audio Retrieval with Natural Language Queries: A Benchmark Study MoE, CE and MMT used 36.1 koepke2022 :scroll:
Audio Retrieval with WavText5K and CLAP Training CLAP training with WavText5K added 49.45 34.69 deshmukh2022 :scroll:
On metric learning for audio-text cross-modal retrieval Metric learning objectives for audio retrieval 33.9 mei2022 :scroll:

Clotho

Title Notes mAP@10 R@1 Paper Code
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 18.2 wu2022 :scroll:
Audio Retrieval with WavText5K and CLAP Training CLAP training with WavText5K added 27.12 16.75 deshmukh2022 :scroll:
On metric learning for audio-text cross-modal retrieval Metric learning objectives for audio retrieval 14.4 mei2022 :scroll:
Audio Retrieval with Natural Language Queries: A Benchmark Study MoE, CE and MMT used 6.7 koepke2022 :scroll:

Audio to Text Retrieval

AudioCaps

Title Notes mAP@10 R@1 Paper Code
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 46.8 wu2022 :scroll:
Audio Retrieval with WavText5K and CLAP Training CLAP training with WavText5K added 30.81 41.91 deshmukh2022 :scroll:
On metric learning for audio-text cross-modal retrieval Metric learning objectives for audio retrieval 39.6 mei2022 :scroll:
Audio Retrieval with Natural Language Queries: A Benchmark Study MoE, CE and MMT used 39.6 koepke2022 :scroll:

Clotho

Title Notes mAP@10 R@1 Paper Code
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation CLAP trained on LAION 650k collection with feature fusion and caption augmentation 25.7 wu2022 :scroll:
Audio Retrieval with WavText5K and CLAP Training CLAP training with WavText5K added 13.65 20.00 deshmukh2022 :scroll:
On metric learning for audio-text cross-modal retrieval Metric learning objectives for audio retrieval 16.9 mei2022 :scroll:
Audio Retrieval with Natural Language Queries: A Benchmark Study MoE, CE and MMT used 7.2 koepke2022 :scroll:

Music Classification

GTZAN Genres

Title Notes Accuracy Paper Code
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 91.3% elizalde2022 :scroll:
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST [HEAR Challenge] 88.3% koutini22 :scroll:
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition CNN models trained on AudioSet [HEAR Challenge] 86.0% kong2019 :scroll:
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP [HEAR Challenge] 74.8% wu2021 :scroll:

GTZAN Music Speech

Title Notes Accuracy Paper Code
CLAP: Learning Audio Concepts From Natural Language Supervision CNN model pretrained by natural language supervision 100% elizalde2022 :scroll:
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition CNN models trained on AudioSet [HEAR Challenge] 99.23% kong2019 :scroll:
PaSST: Efficient Training of Audio Transformers with Patchout drops out some of the input patches during training of AST [HEAR Challenge] 97.69% koutini22 :scroll:
Wav2CLIP: Learning Robust Audio Representations From CLIP Distilling from CLIP [HEAR Challenge] 94.55% wu2021 :scroll:

Glossary

SED: Sound Event Detection
ASC: Acoustic Scene Classification