AFEC
AFEC copied to clipboard
Cross platform audio feature extraction and sound classification tool
AFEC
Audio Feature Extraction
AFEC is a cross platform audio feature extraction and sound classification CLI tool written in C++. It analyzes audio files and saves a set of musically interesting audio-features into a sqlite database, which can then be used for other tasks - e.g. to organize sample libraries or to ease finding sounds with specific audio features (key, BPM, sound classes, RMS and so on).
It's a CLI tool only to generate data - it does NOT provide a GUI to view the analyzed audio features or to preview audio files. There are a few basic dash based GUIs available in the AFEC-Visualizers repository though, to debug AFEC's results.
The AFEC Crawler was initially created for the Sononym project. This open sourced version got forked off from the initial release of Sononym at version 1.0. It's not compatible with Sononym's internal sample crawler and will not try to be in future. AFEC was released as an open source project, in the hope to be useful for other audio projects. The original authors of this project are no longer part of the Sononym project.
Sound Classification with Machine Learning
ACEC is using a subset of the analyzed low-level audio-features to evaluate a pretrained, bootstrap aggregated gradient boosted machine (LightGBM) classification model.
There are other experimental classification models in the source tree such as a simple ANN, RBM, SVM, KNN, Naive Bayes and Random Forest Tree implemented in Shark C++, but only the LightGBM model is used in production as pretrained model. So AFEC also can be used to experiment with audio classification.
See ./Scripts/ModelCreator for scripts that train and test various classification models.
There are also various keras / tensorflow experiments in the AFEC-Classifiers repository, which are using the same data than the internal C++ tools.
Download
Prebuilt binaries can be downloaded here.
Usage
Crawler
Usage:
Crawler[.exe] [options] <paths...>
Synopsis:
Recursively search for audio files in the path(s) and write high or low-level
audio features into the given sqlite database.
Options:
-h [ --help ] Show help message.
-v [ --version ] Show version, build number and other infos.
-l [ --level ] arg (=high) Create a 'high' or 'low' level database.
-m [ --model ] arg Specify the 'Classifiers' and 'OneShot-Categories'
model files that should be used for level='high'.
When not specified, the default models from the
crawler's resource dir are used. Set to 'none' to
explicitely avoid loading the a default model -
e.g. --model "None" --model "None" will disable
both.
-o [ --out ] arg Set destination directory/db_name.db or just a
directory. When only a directory is specified, the
database filename will be: 'afec-ll.db' or
'afec.db', depending on the level. When no
directory or file is specified, the database will
be written into the current working dir.
--paths arg One or more paths to a folder or single audio file
which should be analyzed. Can also be passed as
last (positional) argument.
When all given paths are sub paths of the 'out' db
path, all file paths within the database will be
relative to the out dir, else absolute paths.
ModelTester
Usage:
ModelTester[.exe] [options] <input.db>
Synopsis:
Train and evaluate various classification models to see how they perform against
the given input. Input is an AFEC low-level descriptor sqlite database which is
used as train and test set.
Options:
-h [ --help ] Show help message.
-a [ --all ] arg (=0) When enabled, test all models instead of just the
the default model.
-r [ --repeat ] arg (=10) Number of times the test should be repeated.
-s [ --seed ] arg (=-1) Set random seed, if any, in order to replicate tests.
-b [ --bagging ] arg (=0) When enabled, test bagging ensemble models instead
of 'raw' ones.
-i [ --src_database ] arg The low-level descriptor db file to create the
train and test data from. Can also be passed as
last (positional) argument.
ModelCreator
Usage:
ModelCreator[.exe] [options] <input.db>
Synopsis:
Train and evaluate the default classification model which is defined in
`Source/Crawler/Export/DefaultClassificationModel.h` and create an ensemble
model from the best performing ones.
Options:
-h [ --help ] Show help message.
-r [ --repeat ] arg (=8) Number of times to repeat the model creation with
different training set folds, to choose the best
one along all runs.
-s [ --seed ] arg (=-1) Random seed, if any, in order to replicate runs.
-o [ --dest_model ] arg Destination name and path of the resulting model
file.
When not specified, the model file will be written
into the crawler's resource directory.
-i [ --src_database ] arg The low-level descriptor db file to create the
train and test data from. Can also be passed as
last (positional) argument.```
Supported Audio File Formats
AFEC can read and thus analyze the following audio file-formats:
- Waveform Audio File (.wav): Windows, OSX, Linux
- Audio Interchange (.aif, .aiff, .aifc): Windows, OSX, Linux
- Free Lossless Audio Codec (.fla, .flac): Windows, OSX, Linux
- OGG Vorbis (.ogg): Windows, OSX, Linux
- MPEG-1 Audio Layer 2 (.mp2): Windows, OSX, Linux
- MPEG-1 Audio Layer 3 (.mp3): Windows, OSX, Linux
- MPEG-4 Part 14 (.mp4, .mp4a, .m4a): Windows & OSX only
- Core Audio Format (.caf): OSX only
- Windows Media Audio (.wma): Windows only
- NeXT/Sun Audio (.au): Windows & OSX only
- Advanced Audio Coding (.aac): Windows & OSX only
- Apple SouND (.snd): Windows & OSX only
Extracted Audio-Features
Internal analyzation sample rate currently is hardcoded to 44100 Hz.
The FFT Frame Size is 2048 samples.
The FFT Hop Size is 1024 samples.
High-Level Features
High-level features are written in a sqlite database which uses the following column names and types.
The column name ending specifies the data type (except for the first 3 columns):
- S: String
- R: Real number or integer
- VR: Vector of real numbers in JSON format
- VVR: Vector of a vector of real numbers in JSON format
- ...
Filepath and analyzation status
-
filename
(TEXT):
Absolute or relative path from the database path and name of the analyzed file. -
modtime
(INTEGER):
File modification date in time_t units (unix timestamp). -
status
(TEXT):
"succeeded" or some human readable error message, in case the file could not be opened or read.
Filetype info
-
file_type_S
(TEXT):
The file's normalized file extension. -
file_size_R
(INTEGER):
The file's original raw size in bytes. -
file_length_R
(REAL):
Audio stream's total length in seconds. -
file_sample_rate_R
(INTEGER):
Sampling rate in HZ. -
file_channel_count_R
(INTEGER):
Number of audio channels in the file. -
file_bit_depth_R
(INTEGER):
Audio file-format bit depth.
Classification results
-
class_signature_VR
(TEXT):
JSON array of real numbers. Prediction result of the classification model.
Can be used instad of the normalizedclass_strengths_VR
to find similar sounds with a similar class signature. -
classes_VS
(TEXT: JSON_STRING_ARRAY):
JSON array of strings. Name of "strong" predicted classes - strongest ones first. -
class_strengths_VR
(TEXT: JSON_TEXT_ARRAY):
JSON array of real numbers. Normalized, clamped prediction result of the "strong" predicted classes - strongest ones first.
Categorization results
-
category_signature_VR
(TEXT: JSON_NUMBER_ARRAY):
JSON array of real numbers. Prediction result of the categorization model.
Can be used instad of the normalizedcategory_strengths_VR
to find similar sounds with a similar category signature. -
categories_VS
(TEXT: JSON_STRING_ARRAY):
JSON array of strings. Name of "strong" predicted categories - strongest ones first. -
category_strengths_VR
(TEXT: JSON_NUMBER_ARRAY):
JSON array of real numbers. Normalized, clamped prediction result of the "strong" predicted categories - strongest ones first.
Pitch, Peak and BPM
-
base_note_R
(REAL):
Most dominant key note (if any) in the entire file. Should be used in combination withbase_note_confidence_R
only. -
base_note_confidence_R
(REAL):
Normalized detection confidence value of the base note. -
peak_db_R
(REAL):
Peak value accross all channels in dB. -
rms_db_R
(REAL):
RMS value accross all channels in dB. -
bpm_R
(REAL):
Most dominant BPM (if any) in the entire file accross all channels. Should be used in combination withbpm_confidence_R
to be useful. -
bpm_confidence_R
(REAL):
Normalized detection confidence value of the BPM detection.
Sound Characteristics
-
brightness_R
(REAL):
Overal sound's perceived brightness, calculated from the spectral centroid and rolloff. -
noisiness_R
(REAL):
Overal sound's noisiness level, calculated from the spectral flatness. -
harmonicity_R
(REAL):
Overal sound's harmonicity level, calculated from the auto correlation, pitch confidence and spectral flatness.
Spectral Features
-
spectral_flatness_R
(REAL):
Mean of audible low level spectral_flatness_VR (spectral flatness) -
spectral_flux_R
(REAL):
Mean of audible low level spectral_flux_VR (spectral flux) -
spectral_complexity_R
(REAL):
Mean of audible low level spectral_complexity_VR (spectral complexity measure based on a sharpened spectrum) -
spectral_contrast_R
(REAL):
Mean of audible low level spectral_contrast_VR (spectral contrast) -
spectral_inharmonicity_R
(REAL):
Mean of audible low level spectral_inharmonicity_VR (inharmonicity based on a sharpened spectrum)
Spectrum Band Array
-
spectrum_signature_VVR
(TEXT: JSON_NUMBER_ARRAY_ARRAY):
JSON array of an array of real numbers. 14 bands for 64 time frames (resampled), which can be used an iconic signature alike view of the entire audio file's spectrum.
The 14 spectral bands end at 50, 100, 200, 400, 630, 920, 1270, 1720, 2320, 3150, 4400, 6400.0, 9500, 15500 HZ
Pitch Array
-
pitch_VR
(TEXT: JSON_NUMBER_ARRAY):
JSON array of real numbers. Cleaned pitch note values for for each fft time frame. -
pitch_confidence_R
(REAL):
Mean value of all pitch note value detection confidences.
Peak Array
-
peak_VR
(TEXT: JSON_NUMBER_ARRAY):
JSON array of real numbers. Peak value in dB for for each fft time frame.
Low-Level Features
Low-level features are written in a sqlite database which uses the following column names and types.
Just like for high-level features, the column name ending specifies the data type.
_VVR columns in the database are saved as binary mspack blobs, to save disk space.
Note: All vector features contain the following additional statistical features as well:
min
, max
, median
, mean
, gmean
(geographic mean), variance
, centroid
, spread
, skewness
, kurtosis
, flatness
, dmean
, dvariance
(1st deviation)
Filepath and analyzation status
-
filename
(absolute or relative path to the analyzed file) -
modtime
(file modification date in unix timestamps) -
status
("succeeded" or some human readable error message)
Filetype info
-
file_type_S
(normalized file extension) -
file_size_R
(bytes) -
file_length_R
(seconds) -
file_sample_rate_R
(Hz) -
file_channel_count_R
-
file_bit_depth_R
Effective Length (skipping leading & trailing silence)
-
effectve_length_48dB_R
(gate > 48dB) -
effectve_length_24dB_R
-
effectve_length_12dB_R
Amplitude
-
amplitude_silence_VR
(1 for silence, 0 for non silence) -
amplitude_peak_VR
-
amplitude_rms_VR
-
amplitude_envelope_VR
Spectral Features
-
spectral_rms_VR
(spectral rms) -
spectral_centroid_VR
(spectral centroid) -
spectral_spread_VR
(spectral spread) -
spectral_skewness_VR
(spectral skewness) -
spectral_kurtosis_VR
(spectral kurtosis) -
spectral_flatness_VR
(spectral flatness) -
spectral_rolloff_VR
(spectral rolloff) -
spectral_flux_VR
(spectral flux) -
spectral_inharmonicity_VR
(inharmonicity based on a sharpened spectrum) -
spectral_complexity_VR
(spectral complexity measure based on a sharpened spectrum) -
spectral_contrast_VR
(spectral contrast)
Fundamental Frequency
-
f0_VR
(in Hz for each FFT frame) -
f0_confidence_VR
(0-1 for each F0) -
failsafe_f0_VR
(falling back to last stable F0)
Tristimulus
-
tristimulus1_VR
(mixture of harmonics, timbre based on the F0 detection) -
tristimulus2_VR
-
tristimulus3_VR
auto correlation
-
auto_correlation_VR
Onsets (tuned for mixed/tonal sounds)
-
rhythm_complex_onsets_VR
(onset value for each fft frame) -
rhythm_complex_onset_count_R
(number of detected onsets) -
rhythm_complex_onset_contrast_R
-
rhythm_complex_onset_frequency_mean_R
-
rhythm_complex_onset_strength_R
(overall strength)
Onsets (tuned for percussive sounds)
-
rhythm_percussive_onsets_VR
(see rhythm_complex) -
rhythm_percussive_onset_count_R
-
rhythm_percussive_onset_contrast_R
-
rhythm_percussive_onset_frequency_mean_R
-
rhythm_percussive_onset_strength_R
Tempo
-
rhythm_complex_tempo_R
(BPM) -
rhythm_complex_tempo_confidence_R
(0-1) -
rhythm_percussive_tempo_R
-
rhythm_percussive_tempo_confidence_R
-
rhythm_final_tempo_R
-
rhythm_final_tempo_confidence_R
Spectrum band features (14 bands)
-
spectral_rms_bands_VVR
(14 RMS values for every band - see also Spectral Features) -
spectral_flatness_bands_VVR
-
spectral_flux_bands_VVR
-
spectral_complexity_bands_VVR
-
spectral_contrast_bands_VVR
Spectrum (28 bands)
-
frequency_bands_VVR
(50, 100, 150, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500, 19000, 22050 Hz)
Cepstrum (14 bands)
-
cepstrum_bands_VVR
(MFCC values)
Build
Dependencies
Windows
- cmake 2.8 or later
- git-lfs (download at https://git-lfs.github.com/)
- VisualStudio 2015 or later with C++ support (C++14)
OSX
- cmake 2.8 or later
- git-lfs (download at https://git-lfs.github.com/)
- OSX 10.11 or later
- XCode 7 or later with OSX 10.11 SDK and command line tools installed
Linux
- cmake 2.8 or later (
apt-get install cmake
on Ubuntu) - ninja build system (
apt-get install ninja-build
on Ubuntu) - git-lfs (
apt-get install git-lfs
on Ubuntu) - gcc-7.4 (ubuntu 18.04's default compiler,
apt-get install build-essentials
on Ubuntu). - PkgConfig and Threads (usually already installed)
- libmpg123 headers and library (
apt-get install libmpg123-dev
on Ubuntu)
Third-Party Libraries
AFEC uses the following third-party libraries, which are bundled in the 3rdParty
folder, including precompiled static libraries for Windows (Visual C++) OSX (Clang) and Linux (GCC). Note: if you're trying to build AFEC on Linux with gcc-8 or later, you may get linker errors and then need to recompile a few of the C++ third party libraries. There are build scripts in the Linux/
sub folders in each third party library to do so.
Sound Classification:
- SharkC++: Used for various classification test models and for the model ensemble generation.
- LightGBM: The default classification model.
- TinyDNN: DNN experiments (should be removed).
Audio Feature Extraction:
- Aubio: For pitch/key detection and partly for BPM detection.
- LibXtract: To calculate Mel Frequency Cepstral Coefficients.
Audio Files:
- Resample: Normalize sample rates of analyzed audio files.
- Flac: Flac file decoding.
- OGG & OGG Vorbis: OGG file decoding.
- Mpg123: Mpg file decoding on Linux.
Various:
- Boost: Dependency of SharkC++ and used in various places internally.
- Sqlite: SQLite database support.
- ZLib: Dependency of Sqlite.
- OpenBLAS: Used on Windows as dependency of SharkC++
- CTPL: Enabled multi-processing in the crawler via a thread pool
- Msgpack: Optionally packing of JSON in sqlite database (disabled).
- Iconv: Unicode string UTF8 and platform encoding.
How to Build
The precompiled 3rd party libraries are stored via git lfs, so please ensure the lfs files are checked out:
git lfs pull
then go to ./Build
and run:
./Build/build.sh|bat
The resulting Visual Studio (Windows), XCode (OSX) or Makefiles (Linux) files can then be found at ./Build/Out
.
After building, the produced binaries can be found at ./Dist/[Debug|Release]
.
Authors
AFEC was originally created by Eduard Müller and Ingolf Wagner
License
GNU General Public License v3.0 or later See COPYING to see the full text.
The bundled third-party libraries may use different licenses. Please have a look at the 3rdParty
folder to see which ones.
Contributing
Patches are welcome: please fork the latest git repository and create a feature branch.