crepe icon indicating copy to clipboard operation
crepe copied to clipboard

Apply Viterbi algorithm to predict voiced/unvoiced state of every state based on confidence array

Open sannawag opened this issue 5 years ago • 5 comments

This feature delimitates regions of activation and silence (in monophonic recordings). I am submitting a pull request in case it would be useful for others as well and am very open to feedback.

The modification was added as a function in core: "predict_voicing". The function returns a sequence of 0s and 1s, depending on the predicted voicing state according to a Gaussian HMM model.

Some more details about the function and API modification: "predict_voicing" can be called independently after calling "predict", as described in the update to the documentation. It is also possible to set the "apply-voicing" flag if calling crepe from the command line. This will cause the program to call "predict_voicing", multiply the result with the "frequency" array, setting unvoiced frames to zero, and save the new array, "voiced_frequency", to disk.

sannawag avatar Aug 30 '18 14:08 sannawag

More information about this method can be found in Section 2.2 from pYIN paper: https://www.eecs.qmul.ac.uk/~simond/pub/2014/MauchDixon-PYIN-ICASSP2014.pdf

0b01 avatar May 20 '20 01:05 0b01

@sannawag @0b01 CREPE already supports viterbi decoding: crepe.predict(audio, sr, viterbi=True). For voicing activation we've found that a simple threshold on the returned voicing confidence values work well (where the confidence value is given by the maximum activation value in the activation matrix for each frame).

@sannawag could you perhaps elaborate a little more about why this feature is needed and how it differs from what's already supported?

Thanks!

justinsalamon avatar Jun 02 '20 19:06 justinsalamon

Thanks for your response! @justinsalamon @0b01 My goal is to use the output of the program to determine when the singer is active versus silent at the perceptual level. That should change at the level of seconds, not milliseconds. If I set a hard threshold based on confidence, I get quick alternation between the two states, as can be seen in the thick vertical lines in the plots below (with --viterbi flag set to true). That's what I hope to smooth out using Viterbi.

Screen Shot 2020-06-02 at 4 59 50 PM Screen Shot 2020-06-02 at 5 15 31 PM

Code for this plot:

import csv
import matplotlib.pyplot as plt
import numpy as np

f0 = []
conf = []
thresh = 0.5

with open('MUSDB18HQ/train/Music Delta - Hendrix/vocals.f0.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
        else:
            f0.append(float(row[1]))
            conf.append(float(row[2]))
            line_count += 1
    print(f'Processed {line_count} lines.')

voiced = [1 if c > thresh else 0 for c in conf]
# plt.plot(np.array(f0) * np.array(voiced))
plt.plot(np.array(voiced))
plt.show()

sannawag avatar Jun 03 '20 00:06 sannawag

thanks @sannawag I'll have to give this a closer look, so it might take some time before I can give more feedback.

As a general note it's helpful to first post an issue to discuss the problem, solution and implementation details before making a PR, so we can reach consensus on those things prior to implementation. It's out fault for not providing contribution guidelines (to be amended via #58).

Could you please open an issue, explain what the problem is (as you have via these plots and code), your proposed solution, and cross reference this PR?

Thanks!

justinsalamon avatar Jun 03 '20 19:06 justinsalamon

Thanks for the feedback, @justinsalamon! I've submitted an issue.

sannawag avatar Jun 08 '20 02:06 sannawag