mycroft-precise
mycroft-precise copied to clipboard
Benchmark with picovoice
Have you considered adding your engine as a comparison in: https://github.com/Picovoice/wake-word-benchmark ?
Hey there, it's not a priority for us at the moment but if anyone wanted to train Precise models on their wake word data and give it a go, I'd be interested to see the results.
I only had a quick skim through the page, however it seems like it would be fairly easy to game the results by over fitting your model to the limited number of samples used in the test. Hopefully they're using a broader test set that just hasn't been published (or I might have missed it).
Do you think it is a good evaluation of a wake-word model assuming the model is not trained on any test samples? For example the alexa test is 25 hours with 329 wakewords sprinkled throughout, used only once.
If precise needs alexa training data then there could be a 50/50 split of speakers for train/test.
No, it's not a good benchmarking comparison.
- It's done by an involved party.
- Testing is not completed by an independent testing body.
- It's not clear what parameters were used for testing other tools, and if those were optimized for this test as their own tool.
- It's not clear that the methodology is viable. What were the other options? What metrics are most useful? Are those being calculated in a correct manner?
In short, they have provided a marketing repo, which may be hiding something useful.
Can you suggest a benchmarking methodology? I am pretty close to an independent party, I am looking to compare established wakeword detectors to the one from this paper. My motivation is basically to train a detector (similar architecture to mycroft) that performs near the same level, that I then make robust to adversarial attacks.
I am mostly asking about the testing data creation. MR at low FA (like 1 FA per hour) seems to be pretty real life relevant.
When I have created testing datasets in the past I have used a 1:5 positive:negative example ratio by time, where at least 80% of the negative examples are speech by time, always with background noise like DEMAND mixed between 5 and 15 db SNR. And I have also sometimes modified the positive and negative speech by time dilation/frequency shifting for lack of data.
Their approach has a much higher, about 1:91 positive:negative example ratio by time. Perhaps this is more real life relevant however, since most voice assistant users would only use it 4-10 times a day.
I'm looking for any tips that could improve my next paper, so feel free to share any advice!
For benchmarking? What's the most important metric for your usage? What's second most important? How can repeatably measure those metrics? What tests can you repeatably do to measure them? Can those tests be repeatably completed by anyone independently? Is your testing and data going to be viable for a point-in-time, or will it be usable for a length of time? How do you ensure that your testing is filling the relevant use cases? Does the data all come from one source? Does it need to be pre-processed? Does it need to be licensed? Does it cover the entire spectrum of usage? If not, why not? (eta: some discussion about this in chat: https://chat.mycroft.ai/community/pl/i4xmgfsgjfgxprazafgsaw1q7r)
Another option many people choose is to find an end goal and work to that, then say they've made their benchmarks.
As for the rest...the dataset creation ratio you have sounds similar to the recommended training ratios. The other parts I don't worry about much, since I'm using end goal methods to improve my custom word.
The actual false positives benchmark is good as it takes the LibriSpeech 100 dataset and runs through the KWS and that dataset is set and defined. With a rough python hack you can create something like this.
import tensorflow as tf
import numpy as np
import glob
import os
import soundfile as sf
def softmax_stable(x):
return(np.exp(x - np.max(x)) / np.exp(x - np.max(x)).sum())
def kw_detect(rec, sample_rate ,duration, reset_state):
rec = np.reshape(rec, (1, int(sample_rate * duration)))
#rec = np.multiply(rec, 8)
if reset_state:
for s in range(len(input_details1)):
inputs1[s] = np.zeros(input_details1[s]['shape'], dtype=np.float32)
# Make prediction from model
interpreter1.set_tensor(input_details1[0]['index'], rec)
# set input states (index 1...)
for s in range(1, len(input_details1)):
interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
interpreter1.invoke()
output_data = interpreter1.get_tensor(output_details1[0]['index'])
# get output states and set it back to input states
# which will be fed in the next inference cycle
for s in range(1, len(input_details1)):
# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
out_softmax = softmax_stable(output_data[0])
return out_softmax[0]
# Parameters
duration = 0.020
sample_rate = 16000
num_channels = 1
# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn_state/quantize_opt_for_size_tflite_stream_state_external/stream_state_external.tflite", num_threads=2)
interpreter1.allocate_tensors()
# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()
inputs1 = []
for s in range(len(input_details1)):
inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
kw_hit_qty = 0
total_duration = 0.0
hit_txt = []
reset_state = True
kw_hit_rbuff = np.zeros(13, dtype=np.float32)
for txtfile in glob.glob('/media/stuart/New Volume/Users/Stuart/Downloads/Noise/LibriSpeech/**/*.txt', recursive=True):
dirtxt = os.path.dirname(txtfile)
with open(txtfile) as f:
lines = f.readlines()
for line in lines:
frame = 0
kw_count = 0
kw_hit = False
content = line.split(" ", 1)
flacfile = dirtxt + '/' + content[0] + '.flac'
data, samplerate = sf.read(flacfile, dtype='float32')
total_duration = total_duration + (len(data) / samplerate)
while frame < 100:
start = 320 * frame
rec = data[start:start + 320]
if len(rec) < 320:
break
kw_prob = kw_detect(rec, sample_rate ,duration, reset_state)
if kw_prob > 0.9993:
kw_hit = True
reset_state = True
kw_hit_rbuff = np.zeros(13, dtype=np.float32)
print(flacfile, kw_prob, frame)
else:
reset_state = False
frame += 1
if kw_hit == True:
kw_hit_qty += 1
hit_txt.append(flacfile)
print(kw_hit_qty, total_duration / 3600)
print(kw_hit_qty, total_duration / 3600)
Benchmarks are just comparative and not that great,its not exactly the same but the above uses the same dataset so should be comparative. Also the also state 1 false alarm per 10 hours and that they have supplied an integer its prob likely its rounded from just below 1.5. Then you get adding noise at -10db as this is totally random on what noise is added to what KW and what part of a clip and noise dataset you use, so why I just use the clean. Likely there is prob an already noisy dataset that is set in stone that could be used, but the noise part shouldn't really effect that much from clean as the rms level of -10b as presume they used amplitude otherwise short duration noise would be off the scale. Hence why for just a benchmark that is purely a comparison of a benchmark its perfectly applicable for false negatives but false positives is much harder as there is no dataset set in stone and the bring your own KW dataset is obviously very subjective.
Prob best way is to have a model zoo so firstly they can be shared but also so people can test them themselves and reuse especially if good. The TFLite here https://drive.google.com/file/d/1bGf_b8imzPZJNYDUWR94mWuV0deSMVzM/view?usp=share_link Dataset here https://www.openslr.org/resources/12/train-clean-100.tar.gz They are relatively tiny as the above 'hey marvin' one I have done.
With if kw_prob > 0.999: it gives 12 (100 hours) false negatives for 1.21% false positives (1400 kw), but thinking about it my 'test' set isn't that great as due to ml-commons being full of dross I have to clean the dataset by using previous models before training.
That bit is not of interest though as who cares what someone else's voice for KW provides, but the Libri test is solid but for false negatives just try the model yourself.
As with if kw_prob > 0.9999: it gives 1 (100 hours) false negative for 3.78 % false positives (1400 kw)
The 1400 kw files are here https://drive.google.com/file/d/1dreV5fBIwzdcJnXEueYwc4NeWCyufdS-/view?usp=share_link but as said its your voice as long as the signal is good that matters.
There should be a model zoo and likely should be some benchmarks but maybe there just isn't the will or wish. You could have some benchmarks purely to give some info so someone can test kw with there own voice dataset, 'hey marvin' in this case.