BEATs: Audio Pre-Training with Acoustic Tokenizers |
iterative audio pre-training framework to learn bidirectional encoder representation from audio transformers |
98.1% |
chen22 |
:scroll: |
Masked Autoencoders that Listen |
Image-based MAE for audio spectrograms |
97.4% |
huang2022 |
:scroll: |
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection |
Transformer model with hierarchical structure and token-semantic modules |
97.00% |
chen2022 |
:scroll: |
PaSST: Efficient Training of Audio Transformers with Patchout |
drops out some of the input patches during training of AST |
96.8% |
koutini22 |
:scroll: |
CLAP: Learning Audio Concepts From Natural Language Supervision |
CNN model pretrained by natural language supervision |
96.70% |
elizalde2022 |
:scroll: |
AST: Audio Spectrogram Transformer |
Pure Attention Model Pretrained on AudioSet |
95.70% |
gong2021 |
:scroll: |
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer |
A Transformer model pretrained w/ visual image supervision |
95.70% |
zhao2022 |
:scroll: |
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition |
Multi-stage sequential learning with knowledge transfer from Audioset |
94.10% |
kumar2020 |
|
Efficient End-to-End Audio Embeddings Generation for Audio Classification on Target Applications |
CNN model pretrained on AudioSet |
92.32% |
lopez-meyer2021 |
|
Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks |
Pretrained model with multi-channel features |
89.50% |
kim2020 |
:scroll: |
An Ensemble of Convolutional Neural Networks for Audio Classification |
CNN ensemble with data augmentation |
88.65% |
nanni2020 |
:scroll: |
Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices |
CNN model (ACDNet) with potential compression |
87.1% |
mohaimenuzzaman2021 |
:scroll: |
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification |
CNN with filterbanks learned using convolutional RBM + fusion with GTSC and mel energies |
86.50% |
sailor2017 |
|
Wav2CLIP: Learning Robust Audio Representations From CLIP |
Distilling from CLIP |
85.95% |
wu2021 |
:scroll: |
AclNet: efficient end-to-end audio classification CNN |
CNN with mixup and data augmentation |
85.65% |
huang2018 |
|
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications |
x-vector network with openll3 embeddings |
85.00% |
wilkinghoff2020 |
|
Learning from Between-class Examples for Deep Sound Recognition |
EnvNet-v2 (tokozume2017a) + data augmentation + Between-Class learning |
84.90% |
tokozume2017b |
|
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification |
CNN working with phase encoded mel filterbank energies (PEFBEs), fusion with Mel energies |
84.15% |
tak2017 |
|
Knowledge Transfer from Weakly Labeled Audio using Convolutional Neural Network for Sound Events and Scenes |
CNN pretrained on AudioSet |
83.50% |
kumar2017 |
:scroll: |
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification |
CNN with filterbanks learned using convolutional RBM + fusion with GTSC |
83.00% |
sailor2017 |
|
Deep Multimodal Clustering for Unsupervised Audiovisual Learning |
CNN + unsupervised audio-visual learning |
82.60% |
hu2019 |
|
Novel TEO-based Gammatone Features for Environmental Sound Classification |
Fusion of GTSC & TEO-GTSC with CNN |
81.95% |
agrawal2017 |
|
Learning from Between-class Examples for Deep Sound Recognition |
EnvNet-v2 (tokozume2017a) + Between-Class learning |
81.80% |
tokozume2017b |
|
:headphones: Human accuracy |
Crowdsourcing experiment in classifying ESC-50 by human listeners |
81.30% |
piczak2015a |
:scroll: |
Objects that Sound |
Look, Listen and Learn (L3) network (arandjelovic2017a) with stride 2, larger batches and learning rate schedule |
79.80% |
arandjelovic2017b |
|
Look, Listen and Learn |
8-layer convolutional subnetwork pretrained on an audio-visual correspondence task |
79.30% |
arandjelovic2017a |
|
Learning Environmental Sounds with Multi-scale Convolutional Neural Network |
Multi-scale convolutions with feature fusion (waveform + spectrogram) |
79.10% |
zhu2018 |
|
Novel TEO-based Gammatone Features for Environmental Sound Classification |
GTSC with CNN |
79.10% |
agrawal2017 |
|
Learning from Between-class Examples for Deep Sound Recognition |
EnvNet-v2 (tokozume2017a) + data augmentation |
78.80% |
tokozume2017b |
|
Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification |
CNN with filterbanks learned using convolutional RBM |
78.45% |
sailor2017 |
|
Learning from Between-class Examples for Deep Sound Recognition |
Baseline CNN (piczak2015b) + Batch Normalization + Between-Class learning |
76.90% |
tokozume2017b |
|
Novel TEO-based Gammatone Features for Environmental Sound Classification |
TEO-GTSC with CNN |
74.85% |
agrawal2017 |
|
Learning from Between-class Examples for Deep Sound Recognition |
EnvNet-v2 (tokozume2017a) |
74.40% |
tokozume2017b |
|
Soundnet: Learning sound representations from unlabeled video |
8-layer CNN (raw audio) with transfer learning from unlabeled videos |
74.20% |
aytar2016 |
:scroll: |
Learning from Between-class Examples for Deep Sound Recognition |
18-layer CNN on raw waveforms (dai2016) + Between-Class learning |
73.30% |
tokozume2017b |
|
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification |
CNN working with phase encoded mel filterbank energies (PEFBEs) |
73.25% |
tak2017 |
|
Classifying environmental sounds using image recognition networks |
16 kHz sampling rate, GoogLeNet on spectrograms (40 ms frame length) |
73.20% |
boddapati2017 |
:scroll: |
Learning from Between-class Examples for Deep Sound Recognition |
Baseline CNN (piczak2015b) + Batch Normalization |
72.40% |
tokozume2017b |
|
Novel TEO-based Gammatone Features for Environmental Sound Classification |
Fusion of MFCC & TEO-GTCC with GMM |
72.25% |
agrawal2017 |
|
Learning environmental sounds with end-to-end convolutional neural network (EnvNet) |
Combination of spectrogram and raw waveform CNN |
71.00% |
tokozume2017a |
|
Novel TEO-based Gammatone Features for Environmental Sound Classification |
TEO-GTCC with GMM |
68.85% |
agrawal2017 |
|
Classifying environmental sounds using image recognition networks |
16 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) |
68.70% |
boddapati2017 |
:scroll: |
Very Deep Convolutional Neural Networks for Raw Waveforms |
18-layer CNN on raw waveforms |
68.50% |
dai2016, tokozume2017b |
:scroll: |
Classifying environmental sounds using image recognition networks |
32 kHz sampling rate, GoogLeNet on spectrograms (30 ms frame length) |
67.80% |
boddapati2017 |
:scroll: |
WSNet: Learning Compact and Efficient Networks with Weight Sampling |
SoundNet 8-layer CNN architecture with 100x model compression |
66.25% |
jin2017 |
|
Soundnet: Learning sound representations from unlabeled video |
5-layer CNN (raw audio) with transfer learning from unlabeled videos |
66.10% |
aytar2016 |
:scroll: |
WSNet: Learning Compact and Efficient Networks with Weight Sampling |
SoundNet 8-layer CNN architecture with 180x model compression |
65.80% |
jin2017 |
|
Soundnet: Learning sound representations from unlabeled video |
5-layer CNN trained on raw audio of ESC-50 only |
65.00% |
aytar2016 |
:scroll: |
:bar_chart: Environmental Sound Classification with Convolutional Neural Networks - CNN baseline |
CNN with 2 convolutional and 2 fully-connected layers, mel-spectrograms as input, vertical filters in the first layer |
64.50% |
piczak2015b |
:scroll: |
auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks |
MLP classifier on features extracted with an RNN autoencoder |
64.30% |
freitag2017 |
:scroll: |
Classifying environmental sounds using image recognition networks |
32 kHz sampling rate, AlexNet on spectrograms (30 ms frame length) |
63.20% |
boddapati2017 |
:scroll: |
Classifying environmental sounds using image recognition networks |
CRNN |
60.30% |
boddapati2017 |
:scroll: |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks |
3-layer CNN with vertical filters on wideband mel-STFT (median accuracy) |
56.37% |
huzaifah2017 |
|
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks |
3-layer CNN with square filters on wideband mel-STFT (median accuracy) |
54.00% |
huzaifah2017 |
|
Soundnet: Learning sound representations from unlabeled video |
8-layer CNN trained on raw audio of ESC-50 only |
51.10% |
aytar2016 |
:scroll: |
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks |
5-layer CNN with square filters on wideband mel-STFT (median accuracy) |
50.87% |
huzaifah2017 |
|
Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks |
5-layer CNN with vertical filters on wideband mel-STFT (median accuracy) |
46.25% |
huzaifah2017 |
|
:bar_chart: Baseline - random forest |
Baseline ML approach (MFCC & ZCR + random forest) |
44.30% |
piczak2015a |
:scroll: |
Soundnet: Learning sound representations from unlabeled video |
Convolutional autoencoder trained on unlabeled videos |
39.90% |
aytar2016 |
:scroll: |
:bar_chart: Baseline - SVM |
Baseline ML approach (MFCC & ZCR + SVM) |
39.60% |
piczak2015a |
:scroll: |
:bar_chart: Baseline - k-NN |
Baseline ML approach (MFCC & ZCR + k-NN) |
32.20% |
piczak2015a |
:scroll: |