dcase2019_task3
dcase2019_task3 copied to clipboard
DCASE 2019 Task 3 Sound Event Localization and Detection
DCASE 2019 Task3 Sound Event Localization and Detection is a task to jointly localize and recognize individual sound events and their respective temporal onset and offset times. More description of this task can be found in http://dcase.community/challenge2019/task-sound-event-localization-and-detection.
DATASET
The dataset can be downloaded from http://dcase.community/challenge2019/task-sound-event-localization-and-detection. The dataset contains 400 audio recordings, one minute long recordings sampled at 48 kHz. Two formats of audio, First-Order Ambisonic (FOA) and microphone array (MIC) are provided for each audio recording. Both of FOA and MIC are 4 channels. Each one minute recording contains 11 synthetic polyphonic sound events.
The statistic of the data is shown below:
| Attributes | Dev. recordings | Eva. recordings | |
|---|---|---|---|
| Data | FOA & MIC, 48 kHz | 400 | - |
The log mel spectrogram of the scenes are shown below:
Run the code
0. Prepare data
Download and upzip the data, the data looks like:
dataset_root ├── metadata_dev (400 files) │ ├── split1_ir0_ov1_10.csv │ └── ... ├── foa_dev (400 files) │ ├── split1_ir0_ov1_10.wav │ └── ... ├── mic_dev (400 files) │ ├── split1_ir0_ov1_10.wav │ └── ... └── ...
1. Requirements
python 3.6 + pytorch 1.0
2. Then simply run:
$ Run the bash script ./runme.sh
Or run the commands in runme.sh line by line. The commands includes:
(1) Modify the paths of dataset and your workspace
(2) Extract features
(3) Train model
(4) Inference
Model
We apply convolutional neural networks using the log mel spectrogram of 4 channels audio as input. The targets are onset and offset times, elevation and azimuth of sound events. To train a CNN with 9 layers and a mini-batch size of 32, the training takes approximately 200 ms / iteration on a single card GTX Titan Xp GPU. The model is trained for 5000 iterations. The training looks like:
Load data time: 90.292 s
Training audio num: 300
Validation audio num: 100
------------------------------------
...
------------------------------------
iteration: 5000
train statistics: total_loss: 0.076, event_loss: 0.007, position_loss: 0.069
Total 10 files written to /vol/vssp/msos/qk/workspaces/dcase2019_task3/_temp/submissions/main/Cnn_9layers_foa_dev_logmel_64frames_64melbins
sed_error_rate : 0.057
sed_f1_score : 0.971
doa_error : 8.902
doa_frame_recall : 0.966
seld_score : 0.042
validate statistics: total_loss: 0.449, event_loss: 0.039, position_loss: 0.409
Total 10 files written to /vol/vssp/msos/qk/workspaces/dcase2019_task3/_temp/submissions/main/Cnn_9layers_foa_dev_logmel_64frames_64melbins
sed_error_rate : 0.206
sed_f1_score : 0.875
doa_error : 33.374
doa_frame_recall : 0.894
seld_score : 0.156
train time: 20.135 s, validate time: 7.023 s
Model saved to /vol/vssp/msos/qk/workspaces/dcase2019_task3/models/main/Cnn_9layers_foa_dev_logmel_64frames_64melbins/holdout_fold=1/md_5000_iters.pth
------------------------------------
...
Results
Validation result on 400 audio files
The 9-layer CNN achieves slightly better results than other CNNs. The baseline system result is from [2], which applies phase information as extra input and obtains better DOA result. Our system only use log mel spectrogram magnitue as input, without using phase as input.
Plot results over different iterations
The 5-layer and 9-layer CNN achieve similar results. The 13-layer CNN tends to overfit.
Visualization the prediction
We are able to predict the DOA only using the log mel spectrogram magnitude as input.
Summary
This codebase provides a convolutional neural network (CNN) for DCASE 2019 challenge Task 3 Sound Event Localization and Detection.
Citation
If this codebase is helpful, please feel free to cite the following paper:
[1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yong Xu, Wenwu Wang, Mark D. Plumbley. Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems. arXiv preprint arXiv:1904.03476 (2019).
FAQ
If you met running out of GPU memory error, then try to reduce batch_size.
License
File evaluation_tools/cls_feature_class.py is under TUT_LICENSE.
All other files except utils/cls_feature_class.py is under MIT_LICENSE.
External link
[2] https://github.com/sharathadavanne/seld-dcase2019
[3] http://dcase.community/challenge2019/task-audio-tagging