essentia A how-to is needed for working with Loudness (and other algos)

Could someone pls suggest 'the right way' of how Loudness is supposed to work?

I've tried 4 different combos of EasyLoader with Loudness and none of them work, same with MonoLoader.

# --- from essentia.standard import EasyLoader
# --- from essentia.streaming import Loudness
# OR
# --- from essentia.standard import EasyLoader, Loudness

[   INFO   ] MusicExtractorSVM: no classifier models were configured by default
Traceback (most recent call last):
  File "loudness.py", line 19, in <module>
    loader.audio >> loudness.signal
AttributeError: 'Algo' object has no attribute 'audio'

------------------------------------------------------------------------------------------

# --- from essentia.streaming import EasyLoader
# --- from essentia.standard import Loudness

Traceback (most recent call last):
  File "loudness.py", line 20, in <module>
    loader.audio >> loudness.signal
AttributeError: 'Algo' object has no attribute 'signal'

------------------------------------------------------------------------------------------

# --- from essentia.streaming import EasyLoader, Loudness
Traceback (most recent call last):
  File "loudness.py", line 19, in <module>
    loader.audio >> loudness.signal
  File "/home/airflow/.local/lib/python3.7/site-packages/essentia/streaming.py", line 60, in __rshift__
    right.input_algo, right.name)
TypeError: While connecting EasyLoader::audio to Loudness::signal:
Error when checking types. Expected: std::vector<Real>, received: Real

The code is super simple, just along the lines of:

import sys

import essentia
from essentia.streaming import EasyLoader, Loudness

if len(sys.argv) == 2:
    infile = sys.argv[1]
else:
    print("usage: %s <input audio file>" % sys.argv[0])
    sys.exit()

# initialize algorithms we will use
loader = EasyLoader(filename=infile)
loudness = Loudness()

# use pool to store data
pool = essentia.Pool()

loader.audio >> loudness.signal
loudness.loudness >> (pool, "loudness")

# network is ready, run it
essentia.run(loader)

print("loudness : " + pool["loudness"])

I'm running essentia==2.1b6.dev858. The input WAV file is attached. Thanks.

sample.wav.zip

Apr 27 '23 22:04 dgoldenberg-audiomack

Hi @dgoldenberg-audiomack,

The problem is that Loudness expects a stream of vector_real instead of real, which is the output of EasyLoader. You can use RealAccumulator to compute the loudness of the entire signal at the end of the stream.

import sys

import essentia
from essentia.streaming import EasyLoader, Loudness, RealAccumulator

if len(sys.argv) == 2:
    infile = sys.argv[1]
else:
    print("usage: %s <input audio file>" % sys.argv[0])
    sys.exit()

# initialize algorithms we will use
loader = EasyLoader(filename=infile)
loudness = Loudness()
accumulator = RealAccumulator()

# use pool to store data
pool = essentia.Pool()

loader.audio >> accumulator.data
accumulator.array >> loudness.signal
loudness.loudness >> (pool, "loudness")

# network is ready, run it
essentia.run(loader)
print("loudness : ", pool["loudness"])

Alternatively, you can check our Python example using EBULoudnessR128 which provides a loudness estimation that is more correlated with human perception and is widely used in the audio/music industry.

Apr 28 '23 08:04 palonso

Hi @palonso,

Thanks for your quick, comprehensive response. A noob to Essentia here :)

The problem is that Loudness expects a stream of vector_real instead of real, which is the output of EasyLoader.

Understood; had similar issues with MonoLoader. As a novice, I would just make a suggestion, which is that the framework and its doc set are rather vast, and finding the relevant usable sample code is not always easy and apparent.

For example, if you're just looking at https://essentia.upf.edu/reference/std_Loudness.html, it doesn't have a link to a quick useful example of the kind that you just provided. The same seems true of other doc pages under https://essentia.upf.edu/reference/ as well. So I'd venture a proposal to add coding snippets on all reference doc pages.

check our Python example using EBULoudnessR128 which provides a loudness estimation that is more correlated with human perception and is widely used in the audio/music industry.

Thank you for that reference. Looking at the outputs of that algo:

momentaryLoudness (vector_real) - momentary loudness (over 400ms) (LUFS)
shortTermLoudness (vector_real) - short-term loudness (over 3 seconds) (LUFS)
integratedLoudness (real) - integrated loudness (overall) (LUFS)
loudnessRange (real) - loudness range over an arbitrary long time interval [3] (dB, LU)

If I wanted to come up with a single metric of how loud a music sample is, would you recommend that I pick one of these metrics? i.e. how does one tell, by looking at these outputs whether something is loud, quiet, or in between? e.g. a metric from 0 to 10, 10 being 'definitely loud'?

Thanks!

Apr 28 '23 12:04 dgoldenberg-audiomack

@dgoldenberg-audiomack thank you very much for the feedback!

Regarding your question, integratedLoudness is a single value suitable for your purpose. Assuming that your music is normalized to full scale (i.e., -1/1 range), values lower than -16/-20 LUFS could be considered as quiet, and values higher than -7/-6 are definitely very loud.

Note that if gain normalization is applied to your music before the loudness calculation (such as done by EasyLoader) the loudness estimations are not reliable.

Apr 28 '23 12:04 palonso

Thanks @palonso,

Assuming that your music is normalized to full scale (i.e., -1/1 range)

Could you provide an example or point me at a snippet which performs this type of normalization?

Note that if gain normalization is applied to your music before the loudness calculation (such as done by EasyLoader) the loudness estimations are not reliable.

Currently we're not yet applying any normalizations. If normalization is not applied, would EasyLoader's estimations become more reliable?

Do I understand correctly that your general recommendation is that we use EBULoudnessR128? This algo sounds like a much stronger approach, IIUC.

Apr 28 '23 13:04 dgoldenberg-audiomack

Could you provide an example or point me at a snippet which performs this type of normalization?

Sure, using numpy:

normalized_audio = audio / np.max(np.abs(audio))

For additional context, this is sometimes referred as peak normalization.

Currently we're not yet applying any normalizations. If normalization is not applied, would EasyLoader's estimations become more reliable?

Generally yes, but this depends a bit on your source of audio. If you are working with professionally mastered music, not applying any normalization should be fine. If you also consider processing music that is not professionally mastered, or that you suspect that its gain could have been trimmed, I would recommend applying peak normalization before estimating loudness.

Do I understand correctly that your general recommendation is that we use EBULoudnessR128?

Right, specially if your goal is to make a perceptual estimation of loudness.

May 01 '23 17:05 palonso

Thanks, @palonso.

When processing data in bulk, is it possible to reuse objects such as the loaders and the algos? Are they thread-safe?

May 03 '23 00:05 dgoldenberg-audiomack

Yes, you can reuse the algorithms. Following the example:

files = ["file_1", "file_2"]

# initialize algorithms we will use
loader = EasyLoader()
loudness = Loudness()
accumulator = RealAccumulator()

# use pool to store data
pool = essentia.Pool()

loader.audio >> accumulator.data
accumulator.array >> loudness.signal
loudness.loudness >> (pool, "loudness")

# network is ready, run it
for infile in files:
    pool.clear()
    essentia.reset(loader)
    loader.configure(filename=infile)
    essentia.run(loader)
    print("loudness : ", pool["loudness"])

However, Essentia is not thread-safe, so you should use separate processes to parallelize your bulk analysis.

May 03 '23 06:05 palonso

Perfect, thank you, @palonso!

May 03 '23 10:05 dgoldenberg-audiomack

Hi @palonso,

I'm looking for similar sample snippets for the following:

Dissonance: Inputs

frequencies (vector_real) - the frequencies of the spectral peaks (must be sorted by frequency)
magnitudes (vector_real) - the magnitudes of the spectral peaks (must be sorted by frequency Where to get the frequencies? And how to get the magnitudes? - looking here but it requires an input: complex (vector_complex) - the input vector of complex numbers -- where can those come from?

BPM https://essentia.upf.edu/reference/streaming_RhythmExtractor2013.html Inputs

signal (real) - input signal Is it OK to just do a loader.audio >> bpm.signal ?

Key This takes pcp (vector_real) - the input pitch class profile Would we need to use HPCP? That one also needs frequencies and magnitudes, would like some clarity on how to get those (similar to the case with Dissonance)

Would appreciate your help

May 03 '23 16:05 dgoldenberg-audiomack

Dissonance

Dissonance expects frequencies and magnitudes as output from SpectralPeaks. This requires extracting the spectrum in a frame-wise manner to extract the peaks.

The algorithm chain would be: EasyLoader/MonoLoader >> FrameCutter >> Windowing >> Spectrum >> SpectralPeaks >> Dissonance

You can find tested parametrizations of the algorithms in the unit tests, for example this one.

BPM

Is it OK to just do a loader.audio >> bpm.signal ?

Yes. Just remember to keep the sample rate at 44100 (default in MonoLoader/AudioLoader).

Key

Have a look at KeyExtractor. This is a wrapper for Key that takes audio as input and does all the required steps. loader.audio >> key_extractor.audio

May 04 '23 07:05 palonso

Thanks much for these pointers, @palonso. A side note on these algos; I'm noticing that some algos have multiple outputs such as for example RhythmExtractor2013. If I'm only interested in the bpm value, I'm still 'forced' to connect/extract the rest of the outputs otherwise I get something like this:

RuntimeError: RhythmExtractor2013::ticks is not connected to any sink...

I wonder if it may be of benefit to allow the caller to not connect some of the outputs to sinks?

May 05 '23 15:05 dgoldenberg-audiomack

Hi @palonso

You can find tested parametrizations of the algorithms in the unit tests, for example this one.

If I want to extract dissonance for a wide variety of audio files for which I might not know much about upfront, would the parameters used in that test work reasonably well across the board? I mean all the params here:

        fc = FrameCutter(frameSize=4096, hopSize=512)
        windower = Windowing(type='blackmanharris62')
        specAlg = Spectrum(size=4096)
        sPeaksAlg = SpectralPeaks(sampleRate = sampleRate,
                                  maxFrequency = sampleRate/2,
                                  minFrequency = 0,
                                  orderBy = 'frequency')

Also, the algo doc says that I can grab the dissonance as the output of Dissonance. The test computes "the average dissonance over all frames of audio". I'm wondering if for my purposes I can just stick with Dissonance.dissonance? The average seems to be computed just to make sure the output is in the ballpark, correct?

Elsewhere I see samples where some of these algos are used with defaults, e.g.

framecutter = FrameCutter()
windowing = Windowing(type="blackmanharris62")
spectrum = Spectrum()
spectralpeaks = SpectralPeaks(
    orderBy="magnitude", magnitudeThreshold=1e-05, minFrequency=40, maxFrequency=5000, maxPeaks=10000
)

Here, the FrameCutter is defaulted; actually the SpectralPeaks are set differently.

I'm looking for a way to make this very generic because as I mentioned, I don't know much about the files upfront. However, if maybe there is a way to optimize the parameters first, based on the file, then I'd love to add that, if you have any recommendations. Although simpler seems better for now.

May 05 '23 16:05 dgoldenberg-audiomack

I wonder if it may be of benefit to allow the caller to not connect some of the outputs to sinks?

You can discard an algorithm's output like this: algorithm.output >> None

I'm wondering if for my purposes I can just stick with Dissonance.dissonance? The average seems to be computed just to make sure the output is in the ballpark, correct?

This is entirely up to your use case. In many cases it makes sense to work with track-averaged values, for example, to compare values between songs with different lengths.

I haven't experimented with this algorithm personally, so I can't recommend you a set of parameters. You could optimize the algorithm's parameters by annotating a small dataset with expected dissonance values yourself and trying different combinations of parameters to see which one correlates better with your annotations.

May 05 '23 18:05 palonso

Hi @palonso, thanks for your reply. I'm experimenting with LoudnessEBUR128.

Here's what I've got so far:

loader = AudioLoader(filename=infile)
loudness_e = LoudnessEBUR128(startAtZero=True)

pool = essentia.Pool()

loader.audio >> loudness_e.signal
loader.sampleRate >> None
loader.numberChannels >> None
loader.md5 >> None
loader.bit_rate >> None
loader.codec >> None

loudness_e.integratedLoudness >> (pool, "integrated_loudness")
loudness_e.momentaryLoudness >> None
loudness_e.shortTermLoudness >> None
loudness_e.loudnessRange >> None

essentia.run(loader)

Questions:

I noticed that AudioLoader outputs a sampleRate and LoudnessEBUR128 has sampleRate as one of its parameters. With the way the code is written so far, would the sample rate automatically propagate into LoudnessEBUR128? If not, how could I pass it from the loader to the extractor algo?
Would you recommend the default value 0.1 for the hop size?
For peak normalization, we've discussed the following:

normalized_audio = audio / np.max(np.abs(audio))

How would I wire this into the 'network'? This transform as is seems to just cause me runtime errors.

Also, as far as the "perceptual loudness" assessment:

values lower than -16/-20 LUFS could be considered as quiet, and values higher than -7/-6 are definitely very loud.

So if we to use a "T-shirt" approach to discern: very quiet, quiet, regular loudness, loud, very loud, what would a good mapping be? -

loudness type	value range start	value range end
very quiet	?	?
quiet	?	-16
normal	?	?
loud	?	?
very loud	-6	?

May 08 '23 17:05 dgoldenberg-audiomack

Hi @palonso sorry to bombard you with questions :)

I'm looking into a few tensorflow-based algos. I've done a pip install essentia-tensorflow but keep getting errors such as this one:

Traceback (most recent call last):
  File "danceability.py", line 2, in <module>
    from essentia.standard import MonoLoader, TensorflowPredictMusiCNN
ImportError: cannot import name 'TensorflowPredictMusiCNN' from 'essentia.standard' (/home/.local/lib/python3.7/site-packages/essentia/standard.py)

Code:

import sys
from essentia.standard import MonoLoader, TensorflowPredictMusiCNN

if len(sys.argv) == 2:
    infile = sys.argv[1]
else:
    print("usage: %s <input audio file>" % sys.argv[0])
    sys.exit()

audio = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictMusiCNN(graphFilename="danceability-musicnn-msd-2.pb")
predictions = model(audio)

Versions:

essentia                                 2.1b6.dev1034
essentia-tensorflow                      2.1b6.dev1034

Any ideas as to what might be missing? Thanks

May 09 '23 21:05 dgoldenberg-audiomack

Questions:

I noticed that AudioLoader outputs a sampleRate and LoudnessEBUR128 has sampleRate as one of its parameters. With the way the code is written so far, would the sample rate automatically propagate into LoudnessEBUR128? If not, how could I pass it from the loader to the extractor algo?

Please take a look at my answer below.

Would you recommend the default value 0.1 for the hop size?

yes, we normally compute loudness with the default hop size.

How would I wire this into the 'network'? This transform as is seems to just cause me runtime errors.

To peak-normalize the audio you need to have access to the entire signal before processing. Thus, streaming mode is not the most suitable paradigm. Alternatively, you could:

Define all algorithms in standard mode.
Load audio and read the sample rate.
Normalize the audio using numpy as mentioned.
Reconfigure LoudnessEBU128 according to the audio sample rate. (algorithm.configure(SampleRate=sr))
Compute loudness.

So if we to use a "T-shirt" approach to discern: very quiet, quiet, regular loudness, loud, very loud, what would a good mapping be?

I can't provide an answer since I'm not an expert on loudness metering. Howerver, since the EBU R128 standard is widely used in the industry, you should be able to find several resources related to your question online.

I'm looking into a few tensorflow-based algos. I've done a pip install essentia-tensorflow but keep getting errors such as this one:

According to your error, Python is loading essentia and not essentia-tensorflow. Since the packages are not complementary you should make sure that essentia-tensorflow is the one being loaded. You can achieve this by removing essentia and reinstalling essentia-tensorflow:

pip uninstall essentia
pip uninstall essentia-tensorflow
pip install essentia-tensortflow

May 10 '23 12:05 palonso

That makes sense, @palonso, thank you. I'll look into EBU R128.

Working with essentia-tensortflow, I've run into a few issues.

Model file names. There are discrepancies between the model file names in the doc vs. the files which the package apparently supports, going by this location: https://essentia.upf.edu/models/. For example, for Danceability, one of the samples references danceability-musicnn-msd-2.pb but the actual file name appears to be danceability-msd-musicnn-1.pb (?)
Model file handling. At runtime, the model files are not found. I get the following type of error: RuntimeError: Error while configuring TensorflowPredictMusiCNN: TensorflowPredict: could not open the Tensorflow graph file. Is this intentional in the package to cause the user to cherrypick the model files they need from the model file repository rather than bloat by downloading all of them? However, if we're to pluck them out, we'd need to maintain them in our codebase somewhere, which would bloat the codebase size and if there are updates, we may or may not get them in time. What's the intended usage pattern here, for the model files?
Adding to item 2 here - Going by your comment in https://github.com/MTG/essentia/issues/1313 "Can you download the model, place it in the same folder as your script..." - placing the model next to the py file fixes the model file not found problem. Ideally, I'd like to avoid keeping the model files in the codebase. Any recommendation? If this is the only way, then do we need both the .pb file and .onnx and .json, if any?
model/Sigmoid. Sample code:

audio = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictMusiCNN(graphFilename="danceability-msd-musicnn-1.pb")
predictions = model(audio)

This and a few other cases yield the below error:

Traceback (most recent call last):
  File "[danceability.py](http://danceability.py/)", line 12, in <module>
    model = TensorflowPredictMusiCNN(graphFilename="danceability-msd-musicnn-1.pb")
  File "/home/airflow/.local/lib/python3.7/site-packages/essentia/[standard.py](http://standard.py/)", line 44, in __init__
    self.configure(**kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/essentia/[standard.py](http://standard.py/)", line 64, in configure
    self.__configure__(**kwargs)
RuntimeError: Error while configuring TensorflowPredictMusiCNN: TensorflowPredict: 'model/Sigmoid' is not a valid node name of this graph.
TensorflowPredict: Available node names are:
model/Placeholder, dense/kernel, dense/kernel/read, dense/bias, dense/bias/read, model/dense/MatMul, model/dense/BiasAdd, model/dense/Relu, dense_1/kernel, dense_1/kernel/read, dense_1/bias, dense_1/bias/read, model/dense_1/MatMul, model/dense_1/BiasAdd, model/Softmax.

Reconfigure this algorithm with valid node names as inputs and outputs before starting the processing.

Any recommendation as to how to fix this? 5. Lastly, a minor issue. Any recommendation on how to suppress the verbose output/warnings from TF? e.g. Could not load dynamic library 'libcudart.so.11.0' etc. I was thinking something like what's described here on SOF - ?

May 10 '23 14:05 dgoldenberg-audiomack

By default use the latest version available at https://essentia.upf.edu/models/. In this case, confusion may appear from the difference between the danceability classifiers (v1, and v2) and classification heads (v1). Both options should work but currently, we recommend using the classification heads (these are the ones with examples in our site). Note that this is especially convenient in order to reuse the embeddings for multiple classifiers.
Yes, for now, it is the responsibility of the user to download a set and indicate the path to the models.
You don't need to have the models in your codebase. Just set graphFilename with the /path/to/your/model.pb. You don't need the onnx files but the json contains information such as the name of the output classes and the name of the input/output nodes (layers) of the model.
These models have a Softmax instead of sigmoid output layer, set output=model/Softmax in TensorflowPredictMusiCNN for these cases.
set TF_CPP_MIN_LOG_LEVEL=3 e.g.: TF_CPP_MIN_LOG_LEVEL=3 python my_script.py

May 10 '23 16:05 palonso

Thank you for your fast reply @palonso; very helpful!

So the .json files are informative; but are they required by the library? sounds like not?
For the predictions and embeddings that we get from the various models, what's the general strategy for their use? What I mean is, if we want to get a classifier type of value for Danceability, the doc says Music danceability (2 classes): danceable, not_danceable. How does one map the emitted predictions and/or embeddings to the class values such as danceable vs. not danceable? either as boolean or, preferably float type qualifiers.

May 10 '23 16:05 dgoldenberg-audiomack

So the .json files are informative; but are they required by the library? sounds like not?

Correct, they are not needed by the library.

For the predictions and embeddings that we get from the various models, what's the general strategy for their use? What I mean is, if we want to get a classifier type of value for Danceability, the doc says Music danceability (2 classes): danceable, not_danceable. How does one map the emitted predictions and/or embeddings to the class values such as danceable vs. not danceable? either as boolean or, preferably float type qualifiers.

The embeddings are an intermediate representation to get the predictions through the classification heads. The predictions are a 2D matrix [timestamps, classes] since models operate on windows of 2 seconds. A common way to process the predictions is to average the temporal dimension (first axis), which gives you a vector of overall probabilities for each class.

May 10 '23 20:05 palonso

Hi @palonso, sorry, could you elaborate?

audio = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictMusiCNN(graphFilename="danceability-musicnn-msd-2.pb")
predictions = model(audio)

The 2D matrix looks like this, for example:

[0.3348741 0.6392199]
[0.30579954 0.66699344]
[0.32352865 0.65015364]
[0.3717062 0.6356408]
[0.3720143 0.6452822]
[0.39193273 0.6317995 ]
....

[timestamps, classes]

Do I understand this correctly, in that, using the example values above, the probability for "is danceable" would be the average of [ 0.3348741, 0.30579954, 0.32352865, ..., 0.39193273 ] (the first column) and the probability for "is not danceable" would be the average of [ 0.6392199, 0.66699344, 0.65015364, ..., 0.6317995 ] (the second column)?

The doc just says, "danceable, not_danceable" for the classes; I just want to make sure I'm processing the output predictions correctly.

set TF_CPP_MIN_LOG_LEVEL=3

To set this in code, had to add os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" before the imports of essentia/tf otherwise this didn't have an affect.

Also, could you outline (or point me at) any steps required to get essentia-tensorflow to run on a gpu(s)?

Actually, for that, we're wondering how useful running on gpu may be in terms of performance gains? Any ballpark gauge? Thanks.

May 10 '23 21:05 dgoldenberg-audiomack

Do I understand this correctly, in that, using the example values above, the probability for "is danceable" would be the average of [ 0.3348741, 0.30579954, 0.32352865, ..., 0.39193273 ] (the first column) and the probability for "is not danceable" would be the average of [ 0.6392199, 0.66699344, 0.65015364, ..., 0.6317995 ] (the second column)?

Correct.

Also, could you outline (or point me at) any steps required to get essentia-tensorflow to run on a gpu(s)?

You need to install the CUDA and CuDNN libraries. An option is to use a package manager such as conda as explained on TensorFlow's installation guide. In our case, we need CUDA==11.2 and CUDNN=8.1: conda install -c conda-forge -y cudatoolkit=11.2 cudnn=8.1.

You can expect speed improvements up to 2X, since the extraction of the input mel spectrograms for the models still happens in the CPU and can not be accelerated.

May 11 '23 05:05 palonso

Thanks, @palonso.

Curious about the Music style classification algo, discogs-effnet. The doc there refers to "400 styles from the Discogs taxonomy"; I'm counting about 388 in that list. Actually, the labels.py file has 400. Am I concluding correctly that the idea is to use that labels.py as the ultimate 'source of truth' for the list of predicted genres?

More interestingly, the output of this:

audio = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictEffnetDiscogs(graphFilename="discogs-effnet-bs64-1.pb")
predictions = model(audio)

For my sample file, I got a list of 167 numpy.ndarray's each of which tends to be close to 400 in length but generally not; it's 396, 388, etc. It's not clear to me how these predictions can be mapped to the labels.

Could you describe the process of mapping of these predictions to the actual genre labels?

Also, how could we manage the updates to the genre list? If more genres are added going forward, I'd like to structure our code to be sensitive/flexible to such changes.

For danceability, this works:

loader = MonoLoader(filename=infile, sampleRate=16000)()
model = TensorflowPredictMusiCNN(graphFilename="./models/danceability-musicnn-msd-2.pb")
predictions = model(loader)

Is it possible to express this using the >> operator?

pool = essentia.Pool()
loader.audio >> model.signal
model.predictions >> (pool, "danc_predictions")
essentia.run(loader)

This yields:

Traceback (most recent call last):
  File "danceability.py", line 51, in <module>
    loader.audio >> model.signal
AttributeError: 'numpy.ndarray' object has no attribute 'audio'

Another question on the loader reuse. I'm using EasyLoader for extracting things like loudness and BPM; I'm using MonoLoader with the sampleRate of 16K for danceability and mood. Is it possible to set thing up so as to use a single loader instance, e.g. the MonoLoader with the 16K sampleRate? Or might that negatively affect other extractions such as e.g. loudness or BPM?

May 11 '23 14:05 dgoldenberg-audiomack

Hi @palonso,

I got a list of 167 numpy.ndarray's each of which tends to be close to 400 in length but generally not; it's 396, 388, etc.

I've seen this issue once and have not seen it since, so far.

Question on the approachability model. The doc states that

The models output rather two (approachability_2c) or three (approachability_3c) levels of approachability or continous values (approachability_regression).

I assume that the 2c model outputs the approachable/non_approachable classifiers. What about the 3c? What does the 3rd classifier signify? Which of the 3 models would you recommend as the most accurate?

For example, for the same file, I'm seeing:

with 2c: 0.1091, 0.8909
with 3c: 0.0451, 0.5422, 0.4127
with regression: 0.584

Similar questions on the engagement model, too.

For arousal/valence, would you recomment the DEAM or the Muse model, for general processing of files? Thanks.

May 17 '23 13:05 dgoldenberg-audiomack

Hi @palonso ,

Question on the MTG-Jamendo genre algo. The doc states that it yields 87 classes. However, I'm getting 167, not 87. Any idea as to what the other 80 classes are?

    audio = MonoLoader(filename=infile, sampleRate=16000, resampleQuality=4)()
    embedding_model = TensorflowPredictEffnetDiscogs(
        graphFilename="./models/discogs-effnet-bs64-1.pb", output="PartitionedCall:1"
    )
    embeddings = embedding_model(audio)
    model = TensorflowPredict2D(graphFilename="./models/mtg_jamendo_genre-discogs-effnet-1.pb")
    predictions = model(embeddings)

    print(">> num preds: {}".format(len(predictions)))

May 19 '23 13:05 dgoldenberg-audiomack

Thank you, @palonso. Could you explain the 3 predictions set? How does it work since I presume one is for the "approachable" class, one for "non_approachable", and what's the third class for?

May 22 '23 15:05 dgoldenberg-audiomack

I assume that the 2c model outputs the approachable/non_approachable classifiers. What about the 3c? What does the 3rd classifier signify? Which of the 3 models would you recommend as the most accurate?

2c's output is low-approachability, high-approachability. 3-c's output is low-approachability, medium-approachability, high-approachability. The regression model outputs continuous values from 0 to 1 from low to high and performed the best in our internal evaluation.

The same applies to the engagement model.

For arousal/valence, would you recomment the DEAM or the Muse model, for general processing of files?

What do you mean by general processing of files? Both datasets contain music data only, so the resulting models shouldn't be expected to perform well with other types of signals (e.g., solo instruments, speech), although we never assessed the performance of the models in these scenarios.

According to our study, models based on emoMusic obtained the best performance. Between Deam and Muse, Deam is better for arousal, and Muse is better for valence.

Question on the MTG-Jamendo genre algo. The doc states that it yields 87 classes. However, I'm getting 167, not 87. Any idea as to what the other 80 classes are?

Already explained above

May 22 '23 15:05 palonso

Thank you @palonso , as always, very helpful.

MTG-Jamendo genre algo. The doc states that it yields 87 classes. However, I'm getting 167, not 87. Already explained above...

    model = TensorflowPredict2D(graphFilename="./models/mtg_jamendo_genre-discogs-effnet-1.pb")
    predictions = model(embeddings)

Sorry for the repeat but your explanation seems to be related to a different case. I'm still not groking how to map 167 predictions to 87 classifiers - ?

May 22 '23 15:05 dgoldenberg-audiomack

the first dimension is not the number of classes but the number of timestamps. This number will vary with the length of the input audio since our models generate predictions every 3 seconds.

For most applications, it's enough to generate a single overall value by taking the mean:

overall_results = np.mean(predictions, axis=0)

May 22 '23 15:05 palonso

Hi @palonso. Sorry, not following. If you generate the mean of predictions, that's a single value. How does it map to the 87 classes? i.e. how does it indicate the predicted styles?

(Actually, nevermind, this yields a numpy.ndarray of 87 values :) :) )

May 22 '23 16:05 dgoldenberg-audiomack

essentia essentia copied to clipboard

A how-to is needed for working with Loudness (and other algos)

essentia
essentia copied to clipboard