sound_ai_progress
sound_ai_progress copied to clipboard
Tracking states of the arts and recent results (bibliography) on sound tasks.
Sound AI progress
Tracking states of the arts and recent results (bibliography) on sound AI topics and audio tasks. Feel free to create PRs for new results!
Inspired by wer_are_we and are_we_there_yet
Sound AI or Audio Analytics
Sound AI or Audio Analytics focuses on analyzing and understanding audio signals captured by digital devices, with numerous applications in health & wellbeing, environmental sensing, urban living, and the creative sector.
Table of Contents
- Sound Event Classification
- Acoustic Scene Classification
- Audio Captioning
- Text to Audio Retrieval
- Audio to Text Retrieval
- Music Classification
Sound Event Classification
AudioSet
Title | Notes | mAP | Paper | Code |
---|---|---|---|---|
BEATs: Audio Pre-Training with Acoustic Tokenizers | iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers [ensemble] | 0.506 | chen22 | :scroll: |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST [ensemble] | 0.496 | koutini22 | :scroll: |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules [ensemble] | 0.487 | chen2022 | :scroll: |
BEATs: Audio Pre-Training with Acoustic Tokenizers | iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers | 0.486 | chen22 | :scroll: |
AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet [ensemble] | 0.485 | gong2021 | :scroll: |
Masked Autoencoders that Listen | extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms | 0.473 | huang2022 | :scroll: |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST [non-ensemble] | 0.471 | koutini22 | :scroll: |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules [non-ensemble] | 0.471 | chen2022 | :scroll: |
AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet [non-ensemble] | 0.459 | gong2021 | :scroll: |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | CNN models trained on AudioSet | 0.439 | kong2019 | :scroll: |
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks | Conformer-based self-supervised learning | 0.415 | srivastava2022 |
FSD50K
Title | Notes | mAP | Paper | Code |
---|---|---|---|---|
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST | 0.653 | koutini22 | :scroll: |
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 0.649 | wu2022 | :scroll: |
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 0.5859 | elizalde2022 | :scroll: |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 0.4308 | wu2021 | :scroll: |
ESC50
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
BEATs: Audio Pre-Training with Acoustic Tokenizers | iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers | 98.1% | chen22 | :scroll: |
Masked Autoencoders that Listen | Image-based MAE for audio spectrograms | 97.4% | huang2022 | :scroll: |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection | Transformer model with hierarchical structure and token-semantic modules | 97.00% | chen2022 | :scroll: |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST | 96.8% | koutini22 | :scroll: |
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 96.70% | elizalde2022 | :scroll: |
AST: Audio Spectrogram Transformer | Pure Attention Model Pretrained on AudioSet | 95.70% | gong2021 | :scroll: |
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer | A Transformer model pretrained w/ visual image supervision | 95.70% | zhao2022 | :scroll: |
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition | Multi-stage sequential learning with knowledge transfer from Audioset | 94.10% | kumar2020 | |
Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications | CNN model pretrained on AudioSet | 92.32% | lopez-meyer2021 | |
Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks | Pretrained model with multi-channel features | 89.50% | kim2020 | :scroll: |
An Ensemble of Convolutional Neural Networks for Audio Classification | CNN ensemble with data augmentation | 88.65% | nanni2020 | :scroll: |
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices | CNN model (ACDNet) with potential compression | 87.1% | mohaimenuzzaman2021 | :scroll: |
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies | 86.50% | sailor2017 | |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 85.95% | wu2021 | :scroll: |
AclNet: efficient end-to-end audio classification CNN | CNN with mixup and data augmentation | 85.65% | huang2018 | |
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications | x-vector network with openll3 embeddings | 85.00% | wilkinghoff2020 | |
Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning | 84.90% | tokozume2017b | |
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification | CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies | 84.15% | tak2017 | |
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes | CNN pretrained on AudioSet | 83.50% | kumar2017 | :scroll: |
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM + fusion with GTSC | 83.00% | sailor2017 | |
Deep Multimodal Clustering for Unsupervised Audiovisual Learning | CNN + unsupervised audio-visual learning | 82.60% | hu2019 | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | Fusion of GTSC & TEO-GTSC with CNN | 81.95% | agrawal2017 | |
Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + Between-Class learning | 81.80% | tokozume2017b | |
:headphones: Human accuracy | Crowdsourcing experiment in classifying ESC-50 by human listeners | 81.30% | piczak2015a | :scroll: |
Objects that Sound | Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule | 79.80% | arandjelovic2017b | |
Look, Listen and Learn | 8-layer convolutional subnetwork pretrained on an audio-visual correspondence task | 79.30% | arandjelovic2017a | |
Learning Environmental Sounds with Multi-scale Convolutional Neural Network | Multi-scale convolutions with feature fusion (waveform + spectrogram) | 79.10% | zhu2018 | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | GTSC with CNN | 79.10% | agrawal2017 | |
Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) + data augmentation | 78.80% | tokozume2017b | |
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification | CNN with filterbanks learned using convolutional RBM | 78.45% | sailor2017 | |
Learning from Between-class Examples for Deep Sound Recognition | Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning | 76.90% | tokozume2017b | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | TEO-GTSC with CNN | 74.85% | agrawal2017 | |
Learning from Between-class Examples for Deep Sound Recognition | EnvNet-v2 (tokozume2017a) | 74.40% | tokozume2017b | |
Soundnet: Learning sound representations from unlabeled video | 8-layer CNN (raw audio) with transfer learning from unlabeled videos | 74.20% | aytar2016 | :scroll: |
Learning from Between-class Examples for Deep Sound Recognition | 18-layer CNN on raw waveforms (dai2016) + Between-Class learning | 73.30% | tokozume2017b | |
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification | CNN working with phase encoded mel filterbank energies (PEFBEs) | 73.25% | tak2017 | |
Classifying environmental sounds using image recognition networks | 16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length) | 73.20% | boddapati2017 | :scroll: |
Learning from Between-class Examples for Deep Sound Recognition | Baseline CNN (piczak2015b) + Batch Normalization | 72.40% | tokozume2017b | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | Fusion of MFCC & TEO-GTCC with GMM | 72.25% | agrawal2017 | |
Learning environmental sounds with end-to-end convolutional neural network (EnvNet) | Combination of spectrogram and raw waveform CNN | 71.00% | tokozume2017a | |
Novel TEO-based Gammatone Features for Environmental Sound Classification | TEO-GTCC with GMM | 68.85% | agrawal2017 | |
Classifying environmental sounds using image recognition networks | 16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) | 68.70% | boddapati2017 | :scroll: |
Very Deep Convolutional Neural Networks for Raw Waveforms | 18-layer CNN on raw waveforms | 68.50% | dai2016, tokozume2017b | :scroll: |
Classifying environmental sounds using image recognition networks | 32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length) | 67.80% | boddapati2017 | :scroll: |
WSNet: Learning Compact and Efficient Networks with Weight Sampling | SoundNet 8-layer CNN architecture with 100x model compression | 66.25% | jin2017 | |
Soundnet: Learning sound representations from unlabeled video | 5-layer CNN (raw audio) with transfer learning from unlabeled videos | 66.10% | aytar2016 | :scroll: |
WSNet: Learning Compact and Efficient Networks with Weight Sampling | SoundNet 8-layer CNN architecture with 180x model compression | 65.80% | jin2017 | |
Soundnet: Learning sound representations from unlabeled video | 5-layer CNN trained on raw audio of ESC-50 only | 65.00% | aytar2016 | :scroll: |
:bar_chart: Environmental Sound Classification with Convolutional Neural Networks - CNN baseline | CNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer | 64.50% | piczak2015b | :scroll: |
auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks | MLP classifier on features extracted with an RNN autoencoder | 64.30% | freitag2017 | :scroll: |
Classifying environmental sounds using image recognition networks | 32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) | 63.20% | boddapati2017 | :scroll: |
Classifying environmental sounds using image recognition networks | CRNN | 60.30% | boddapati2017 | :scroll: |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 3-layer CNN with vertical filters on wideband mel-STFT (median accuracy) | 56.37% | huzaifah2017 | |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 3-layer CNN with square filters on wideband mel-STFT (median accuracy) | 54.00% | huzaifah2017 | |
Soundnet: Learning sound representations from unlabeled video | 8-layer CNN trained on raw audio of ESC-50 only | 51.10% | aytar2016 | :scroll: |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 5-layer CNN with square filters on wideband mel-STFT (median accuracy) | 50.87% | huzaifah2017 | |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks | 5-layer CNN with vertical filters on wideband mel-STFT (median accuracy) | 46.25% | huzaifah2017 | |
:bar_chart: Baseline - random forest | Baseline ML approach (MFCC & ZCR + random forest) | 44.30% | piczak2015a | :scroll: |
Soundnet: Learning sound representations from unlabeled video | Convolutional autoencoder trained on unlabeled videos | 39.90% | aytar2016 | :scroll: |
:bar_chart: Baseline - SVM | Baseline ML approach (MFCC & ZCR + SVM) | 39.60% | piczak2015a | :scroll: |
:bar_chart: Baseline - k-NN | Baseline ML approach (MFCC & ZCR + k-NN) | 32.20% | piczak2015a | :scroll: |
US8K
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
AudioCLIP: Extending CLIP to Image, Text and Audio | incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset | 90.07% | guzhov2021 | :scroll: |
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 87.96% | elizalde2022 | :scroll: |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 81.01% | wu2021 | :scroll: |
VocalSound
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 97.95% | elizalde2022 | :scroll: |
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition | EfficientNetB0 | 90.5% | gong2022 | :scroll: |
VGGSound
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
Slow-Fast Auditory Streams For Audio Recognition | two-stream convolutional network for audio recognition | 54.4% | kazakos2022 | :scroll: |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP | 46.63% | wu2021 | :scroll: |
Acoustic Scene Classification
Audio Captioning
AudioCaps
Title | Notes | SPIDEr | Paper | Code |
---|---|---|---|---|
Audio Captioning Transformer | Transformer network based on an encoder-decoder architecture | 0.426 | mei2021 | :scroll: |
Clotho
Title | Notes | SPIDEr | Paper | Code |
---|---|---|---|---|
WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information | two-branch audio encoder for learning temporal and local time-frequency information | 0.182 | tran2020 | :scroll: |
Text to Audio Retrieval
AudioCaps
Title | Notes | mAP@10 | R@1 | Paper | Code |
---|---|---|---|---|---|
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 36.7 | wu2022 | :scroll: | |
Audio Retrieval with Natural Language Queries: A Benchmark Study | MoE, CE and MMT used | 36.1 | koepke2022 | :scroll: | |
Audio Retrieval with WavText5K and CLAP Training | CLAP training with WavText5K added | 49.45 | 34.69 | deshmukh2022 | :scroll: |
On metric learning for audio-text cross-modal retrieval | Metric learning objectives for audio retrieval | 33.9 | mei2022 | :scroll: |
Clotho
Title | Notes | mAP@10 | R@1 | Paper | Code |
---|---|---|---|---|---|
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 18.2 | wu2022 | :scroll: | |
Audio Retrieval with WavText5K and CLAP Training | CLAP training with WavText5K added | 27.12 | 16.75 | deshmukh2022 | :scroll: |
On metric learning for audio-text cross-modal retrieval | Metric learning objectives for audio retrieval | 14.4 | mei2022 | :scroll: | |
Audio Retrieval with Natural Language Queries: A Benchmark Study | MoE, CE and MMT used | 6.7 | koepke2022 | :scroll: |
Audio to Text Retrieval
AudioCaps
Title | Notes | mAP@10 | R@1 | Paper | Code |
---|---|---|---|---|---|
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 46.8 | wu2022 | :scroll: | |
Audio Retrieval with WavText5K and CLAP Training | CLAP training with WavText5K added | 30.81 | 41.91 | deshmukh2022 | :scroll: |
On metric learning for audio-text cross-modal retrieval | Metric learning objectives for audio retrieval | 39.6 | mei2022 | :scroll: | |
Audio Retrieval with Natural Language Queries: A Benchmark Study | MoE, CE and MMT used | 39.6 | koepke2022 | :scroll: |
Clotho
Title | Notes | mAP@10 | R@1 | Paper | Code |
---|---|---|---|---|---|
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | CLAP trained on LAION 650k collection with feature fusion and caption augmentation | 25.7 | wu2022 | :scroll: | |
Audio Retrieval with WavText5K and CLAP Training | CLAP training with WavText5K added | 13.65 | 20.00 | deshmukh2022 | :scroll: |
On metric learning for audio-text cross-modal retrieval | Metric learning objectives for audio retrieval | 16.9 | mei2022 | :scroll: | |
Audio Retrieval with Natural Language Queries: A Benchmark Study | MoE, CE and MMT used | 7.2 | koepke2022 | :scroll: |
Music Classification
GTZAN Genres
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 91.3% | elizalde2022 | :scroll: |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST [HEAR Challenge] | 88.3% | koutini22 | :scroll: |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | CNN models trained on AudioSet [HEAR Challenge] | 86.0% | kong2019 | :scroll: |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP [HEAR Challenge] | 74.8% | wu2021 | :scroll: |
GTZAN Music Speech
Title | Notes | Accuracy | Paper | Code |
---|---|---|---|---|
CLAP: Learning Audio Concepts From Natural Language Supervision | CNN model pretrained by natural language supervision | 100% | elizalde2022 | :scroll: |
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition | CNN models trained on AudioSet [HEAR Challenge] | 99.23% | kong2019 | :scroll: |
PaSST: Efficient Training of Audio Transformers with Patchout | drops out some of the input patches during training of AST [HEAR Challenge] | 97.69% | koutini22 | :scroll: |
Wav2CLIP: Learning Robust Audio Representations From CLIP | Distilling from CLIP [HEAR Challenge] | 94.55% | wu2021 | :scroll: |
Glossary
SED: Sound Event Detection
ASC: Acoustic Scene Classification