Sound AI progress

Tracking states of the arts and recent results (bibliography) on sound AI topics and audio tasks. Feel free to create PRs for new results!

Inspired by wer_are_we and are_we_there_yet

Sound AI or Audio Analytics

Sound AI or Audio Analytics focuses on analyzing and understanding audio signals captured by digital devices, with numerous applications in health & wellbeing, environmental sensing, urban living, and the creative sector.

Sound Event Classification
Acoustic Scene Classification
Audio Captioning
Text to Audio Retrieval
Audio to Text Retrieval
Music Classification

Sound Event Classification

AudioSet

_Title	_Notes	_mAP	_Paper	_Code
_{BEATs: Audio Pre-Training with Acoustic Tokenizers}	_{iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers [ensemble]}	_0.506	_chen22	:scroll:
_{PaSST: Efficient Training of Audio Transformers with Patchout}	_{drops out some of the input patches during training of AST [ensemble]}	_0.496	_koutini22	:scroll:
_{HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection}	_{Transformer model with hierarchical structure and token-semantic modules [ensemble]}	_0.487	_chen2022	:scroll:
_{BEATs: Audio Pre-Training with Acoustic Tokenizers}	_{iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers}	_0.486	_chen22	:scroll:
_{AST: Audio Spectrogram Transformer}	_{Pure Attention Model Pretrained on AudioSet [ensemble]}	_0.485	_gong2021	:scroll:
_{Masked Autoencoders that Listen}	_{extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms}	_0.473	_huang2022	:scroll:
_{PaSST: Efficient Training of Audio Transformers with Patchout}	_{drops out some of the input patches during training of AST [non-ensemble]}	_0.471	_koutini22	:scroll:
_{HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection}	_{Transformer model with hierarchical structure and token-semantic modules [non-ensemble]}	_0.471	_chen2022	:scroll:
_{AST: Audio Spectrogram Transformer}	_{Pure Attention Model Pretrained on AudioSet [non-ensemble]}	_0.459	_gong2021	:scroll:
_{PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition}	_{CNN models trained on AudioSet}	_0.439	_kong2019	:scroll:
_{Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks}	_{Conformer-based self-supervised learning}	_0.415	_{srivastava2022}

FSD50K

_Title	_Notes	_mAP	_Paper	_Code
_{PaSST: Efficient Training of Audio Transformers with Patchout}	_{drops out some of the input patches during training of AST}	_0.653	_koutini22	:scroll:
_{Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation}	_{CLAP trained on LAION 650k collection with feature fusion and caption augmentation}	_0.649	_wu2022	:scroll:
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_0.5859	_elizalde2022	:scroll:
_{Wav2CLIP: Learning Robust Audio Representations From CLIP}	_{Distilling from CLIP}	_0.4308	_wu2021	:scroll:

ESC50

_Title	_Notes	_Accuracy	_Paper	_Code
_{BEATs: Audio Pre-Training with Acoustic Tokenizers}	_{iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers}	_98.1%	_chen22	:scroll:
_{Masked Autoencoders that Listen}	_{Image-based MAE for audio spectrograms}	_97.4%	_huang2022	:scroll:
_{HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection}	_{Transformer model with hierarchical structure and token-semantic modules}	_97.00%	_chen2022	:scroll:
_{PaSST: Efficient Training of Audio Transformers with Patchout}	_{drops out some of the input patches during training of AST}	_96.8%	_koutini22	:scroll:
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_96.70%	_elizalde2022	:scroll:
_{AST: Audio Spectrogram Transformer}	_{Pure Attention Model Pretrained on AudioSet}	_95.70%	_gong2021	:scroll:
_{Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer}	_{A Transformer model pretrained w/ visual image supervision}	_95.70%	_zhao2022	:scroll:
_{A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition}	_{Multi-stage sequential learning with knowledge transfer from Audioset}	_94.10%	_kumar2020
_{Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications}	_{CNN model pretrained on AudioSet}	_92.32%	_{lopez-meyer2021}
_{Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks}	_{Pretrained model with multi-channel features}	_89.50%	_kim2020	:scroll:
_{An Ensemble of Convolutional Neural Networks for Audio Classification}	_{CNN ensemble with data augmentation}	_88.65%	_nanni2020	:scroll:
_{Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices}	_{CNN model (ACDNet) with potential compression}	_87.1%	_{mohaimenuzzaman2021}	:scroll:
_{Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification}	_{CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies}	_86.50%	_sailor2017
_{Wav2CLIP: Learning Robust Audio Representations From CLIP}	_{Distilling from CLIP}	_85.95%	_wu2021	:scroll:
_{AclNet: efficient end-to-end audio classification CNN}	_{CNN with mixup and data augmentation}	_85.65%	_huang2018
_{On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications}	_{x-vector network with openll3 embeddings}	_85.00%	_{wilkinghoff2020}
_{Learning from Between-class Examples for Deep Sound Recognition}	_{EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning}	_84.90%	_{tokozume2017b}
_{Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification}	_{CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies}	_84.15%	_tak2017
_{Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes}	_{CNN pretrained on AudioSet}	_83.50%	_kumar2017	:scroll:
_{Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification}	_{CNN with filterbanks learned using convolutional RBM + fusion with GTSC}	_83.00%	_sailor2017
_{Deep Multimodal Clustering for Unsupervised Audiovisual Learning}	_{CNN + unsupervised audio-visual learning}	_82.60%	_hu2019
_{Novel TEO-based Gammatone Features for Environmental Sound Classification}	_{Fusion of GTSC & TEO-GTSC with CNN}	_81.95%	_agrawal2017
_{Learning from Between-class Examples for Deep Sound Recognition}	_{EnvNet-v2 (tokozume2017a) + Between-Class learning}	_81.80%	_{tokozume2017b}
:headphones: _{Human accuracy}	_{Crowdsourcing experiment in classifying ESC-50 by human listeners}	_81.30%	_piczak2015a	:scroll:
_{Objects that Sound}	_{Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule}	_79.80%	_{arandjelovic2017b}
_{Look, Listen and Learn}	_{8-layer convolutional subnetwork pretrained on an audio-visual correspondence task}	_79.30%	_{arandjelovic2017a}
_{Learning Environmental Sounds with Multi-scale Convolutional Neural Network}	_{Multi-scale convolutions with feature fusion (waveform + spectrogram)}	_79.10%	_zhu2018
_{Novel TEO-based Gammatone Features for Environmental Sound Classification}	_{GTSC with CNN}	_79.10%	_agrawal2017
_{Learning from Between-class Examples for Deep Sound Recognition}	_{EnvNet-v2 (tokozume2017a) + data augmentation}	_78.80%	_{tokozume2017b}
_{Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification}	_{CNN with filterbanks learned using convolutional RBM}	_78.45%	_sailor2017
_{Learning from Between-class Examples for Deep Sound Recognition}	_{Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning}	_76.90%	_{tokozume2017b}
_{Novel TEO-based Gammatone Features for Environmental Sound Classification}	_{TEO-GTSC with CNN}	_74.85%	_agrawal2017
_{Learning from Between-class Examples for Deep Sound Recognition}	_{EnvNet-v2 (tokozume2017a)}	_74.40%	_{tokozume2017b}
_{Soundnet: Learning sound representations from unlabeled video}	_{8-layer CNN (raw audio) with transfer learning from unlabeled videos}	_74.20%	_aytar2016	:scroll:
_{Learning from Between-class Examples for Deep Sound Recognition}	_{18-layer CNN on raw waveforms (dai2016) + Between-Class learning}	_73.30%	_{tokozume2017b}
_{Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification}	_{CNN working with phase encoded mel filterbank energies (PEFBEs)}	_73.25%	_tak2017
_{Classifying environmental sounds using image recognition networks}	_{16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length)}	_73.20%	_{boddapati2017}	:scroll:
_{Learning from Between-class Examples for Deep Sound Recognition}	_{Baseline CNN (piczak2015b) + Batch Normalization}	_72.40%	_{tokozume2017b}
_{Novel TEO-based Gammatone Features for Environmental Sound Classification}	_{Fusion of MFCC & TEO-GTCC with GMM}	_72.25%	_agrawal2017
_{Learning environmental sounds with end-to-end convolutional neural network (EnvNet)}	_{Combination of spectrogram and raw waveform CNN}	_71.00%	_{tokozume2017a}
_{Novel TEO-based Gammatone Features for Environmental Sound Classification}	_{TEO-GTCC with GMM}	_68.85%	_agrawal2017
_{Classifying environmental sounds using image recognition networks}	_{16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length)}	_68.70%	_{boddapati2017}	:scroll:
_{Very Deep Convolutional Neural Networks for Raw Waveforms}	_{18-layer CNN on raw waveforms}	_68.50%	_{dai2016, tokozume2017b}	:scroll:
_{Classifying environmental sounds using image recognition networks}	_{32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length)}	_67.80%	_{boddapati2017}	:scroll:
_{WSNet: Learning Compact and Efficient Networks with Weight Sampling}	_{SoundNet 8-layer CNN architecture with 100x model compression}	_66.25%	_jin2017
_{Soundnet: Learning sound representations from unlabeled video}	_{5-layer CNN (raw audio) with transfer learning from unlabeled videos}	_66.10%	_aytar2016	:scroll:
_{WSNet: Learning Compact and Efficient Networks with Weight Sampling}	_{SoundNet 8-layer CNN architecture with 180x model compression}	_65.80%	_jin2017
_{Soundnet: Learning sound representations from unlabeled video}	_{5-layer CNN trained on raw audio of ESC-50 only}	_65.00%	_aytar2016	:scroll:
_{:bar_chart: Environmental Sound Classification with Convolutional Neural Networks - CNN baseline}	_{CNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer}	_64.50%	_piczak2015b	:scroll:
_{auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks}	_{MLP classifier on features extracted with an RNN autoencoder}	_64.30%	_freitag2017	:scroll:
_{Classifying environmental sounds using image recognition networks}	_{32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length)}	_63.20%	_{boddapati2017}	:scroll:
_{Classifying environmental sounds using image recognition networks}	_CRNN	_60.30%	_{boddapati2017}	:scroll:
_{Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks}	_{3-layer CNN with vertical filters on wideband mel-STFT (median accuracy)}	_56.37%	_huzaifah2017
_{Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks}	_{3-layer CNN with square filters on wideband mel-STFT (median accuracy)}	_54.00%	_huzaifah2017
_{Soundnet: Learning sound representations from unlabeled video}	_{8-layer CNN trained on raw audio of ESC-50 only}	_51.10%	_aytar2016	:scroll:
_{Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks}	_{5-layer CNN with square filters on wideband mel-STFT (median accuracy)}	_50.87%	_huzaifah2017
_{Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks}	_{5-layer CNN with vertical filters on wideband mel-STFT (median accuracy)}	_46.25%	_huzaifah2017
:bar_chart: _{Baseline - random forest}	_{Baseline ML approach (MFCC & ZCR + random forest)}	_44.30%	_piczak2015a	:scroll:
_{Soundnet: Learning sound representations from unlabeled video}	_{Convolutional autoencoder trained on unlabeled videos}	_39.90%	_aytar2016	:scroll:
:bar_chart: _{Baseline - SVM}	_{Baseline ML approach (MFCC & ZCR + SVM)}	_39.60%	_piczak2015a	:scroll:
:bar_chart: _{Baseline - k-NN}	_{Baseline ML approach (MFCC & ZCR + k-NN)}	_32.20%	_piczak2015a	:scroll:

US8K

_Title	_Notes	_Accuracy	_Paper	_Code
_{AudioCLIP: Extending CLIP to Image, Text and Audio}	_{incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset}	_90.07%	_guzhov2021	:scroll:
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_87.96%	_elizalde2022	:scroll:
_{Wav2CLIP: Learning Robust Audio Representations From CLIP}	_{Distilling from CLIP}	_81.01%	_wu2021	:scroll:

VocalSound

_Title	_Notes	_Accuracy	_Paper	_Code
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_97.95%	_elizalde2022	:scroll:
_{Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition}	_{EfficientNetB0}	_90.5%	_gong2022	:scroll:

VGGSound

_Title	_Notes	_Accuracy	_Paper	_Code
_{Slow-Fast Auditory Streams For Audio Recognition}	_{two-stream convolutional network for audio recognition}	_54.4%	_kazakos2022	:scroll:
_{Wav2CLIP: Learning Robust Audio Representations From CLIP}	_{Distilling from CLIP}	_46.63%	_wu2021	:scroll:

Acoustic Scene Classification

Audio Captioning

AudioCaps

_Title	_Notes	_SPIDEr	_Paper	_Code
_{Audio Captioning Transformer}	_{Transformer network based on an encoder-decoder architecture}	_0.426	_mei2021	:scroll:

Clotho

_Title	_Notes	_SPIDEr	_Paper	_Code
_{WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information}	_{two-branch audio encoder for learning temporal and local time-frequency information}	_0.182	_tran2020	:scroll:

Text to Audio Retrieval

AudioCaps

_Title	_Notes	_mAP@10	_R@1	_Paper	_Code
_{Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation}	_{CLAP trained on LAION 650k collection with feature fusion and caption augmentation}		_36.7	_wu2022	:scroll:
_{Audio Retrieval with Natural Language Queries: A Benchmark Study}	_{MoE, CE and MMT used}		_36.1	_koepke2022	:scroll:
_{Audio Retrieval with WavText5K and CLAP Training}	_{CLAP training with WavText5K added}	_49.45	_34.69	_deshmukh2022	:scroll:
_{On metric learning for audio-text cross-modal retrieval}	_{Metric learning objectives for audio retrieval}		_33.9	_mei2022	:scroll:

Clotho

_Title	_Notes	_mAP@10	_R@1	_Paper	_Code
_{Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation}	_{CLAP trained on LAION 650k collection with feature fusion and caption augmentation}		_18.2	_wu2022	:scroll:
_{Audio Retrieval with WavText5K and CLAP Training}	_{CLAP training with WavText5K added}	_27.12	_16.75	_deshmukh2022	:scroll:
_{On metric learning for audio-text cross-modal retrieval}	_{Metric learning objectives for audio retrieval}		_14.4	_mei2022	:scroll:
_{Audio Retrieval with Natural Language Queries: A Benchmark Study}	_{MoE, CE and MMT used}		_6.7	_koepke2022	:scroll:

Audio to Text Retrieval

AudioCaps

_Title	_Notes	_mAP@10	_R@1	_Paper	_Code
_{Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation}	_{CLAP trained on LAION 650k collection with feature fusion and caption augmentation}		_46.8	_wu2022	:scroll:
_{Audio Retrieval with WavText5K and CLAP Training}	_{CLAP training with WavText5K added}	_30.81	_41.91	_deshmukh2022	:scroll:
_{On metric learning for audio-text cross-modal retrieval}	_{Metric learning objectives for audio retrieval}		_39.6	_mei2022	:scroll:
_{Audio Retrieval with Natural Language Queries: A Benchmark Study}	_{MoE, CE and MMT used}		_39.6	_koepke2022	:scroll:

Clotho

_Title	_Notes	_mAP@10	_R@1	_Paper	_Code
_{Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation}	_{CLAP trained on LAION 650k collection with feature fusion and caption augmentation}		_25.7	_wu2022	:scroll:
_{Audio Retrieval with WavText5K and CLAP Training}	_{CLAP training with WavText5K added}	_13.65	_20.00	_deshmukh2022	:scroll:
_{On metric learning for audio-text cross-modal retrieval}	_{Metric learning objectives for audio retrieval}		_16.9	_mei2022	:scroll:
_{Audio Retrieval with Natural Language Queries: A Benchmark Study}	_{MoE, CE and MMT used}		_7.2	_koepke2022	:scroll:

Music Classification

GTZAN Genres

_Title	_Notes	_Accuracy	_Paper	_Code
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_91.3%	_elizalde2022	:scroll:
_{PaSST: Efficient Training of Audio Transformers with Patchout}	_{drops out some of the input patches during training of AST [HEAR Challenge]}	_88.3%	_koutini22	:scroll:
_{PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition}	_{CNN models trained on AudioSet [HEAR Challenge]}	_86.0%	_kong2019	:scroll:
_{Wav2CLIP: Learning Robust Audio Representations From CLIP}	_{Distilling from CLIP [HEAR Challenge]}	_74.8%	_wu2021	:scroll:

GTZAN Music Speech

_Title	_Notes	_Accuracy	_Paper	_Code
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_100%	_elizalde2022	:scroll:
_{PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition}	_{CNN models trained on AudioSet [HEAR Challenge]}	_99.23%	_kong2019	:scroll:
_{PaSST: Efficient Training of Audio Transformers with Patchout}	_{drops out some of the input patches during training of AST [HEAR Challenge]}	_97.69%	_koutini22	:scroll:
_{Wav2CLIP: Learning Robust Audio Representations From CLIP}	_{Distilling from CLIP [HEAR Challenge]}	_94.55%	_wu2021	:scroll:

Glossary

SED: Sound Event Detection
ASC: Acoustic Scene Classification

sound_ai_progress
sound_ai_progress copied to clipboard

Metadata

Sound AI progress

Sound AI or Audio Analytics

Table of Contents

Sound Event Classification

AudioSet

FSD50K

ESC50

US8K

VocalSound

VGGSound

Acoustic Scene Classification

Audio Captioning

AudioCaps

Clotho

Text to Audio Retrieval

AudioCaps

Clotho

Audio to Text Retrieval

AudioCaps

Clotho

Music Classification

GTZAN Genres

GTZAN Music Speech

Glossary

← Metadata

Owner

Metadata

sound_ai_progress sound_ai_progress copied to clipboard

Metadata

Sound AI progress

Sound AI or Audio Analytics

Table of Contents

Sound Event Classification

AudioSet

FSD50K

ESC50

US8K

VocalSound

VGGSound

Acoustic Scene Classification

Audio Captioning

AudioCaps

Clotho

Text to Audio Retrieval

AudioCaps

Clotho

Audio to Text Retrieval

AudioCaps

Clotho

Music Classification

GTZAN Genres

GTZAN Music Speech

Glossary

← Metadata

Owner

Metadata

sound_ai_progress
sound_ai_progress copied to clipboard