speech-recognition
speech-recognition copied to clipboard
Inference question
Thanks for the excellent blog post, looking forward to training my own model when I get back from vacation.
In the mean time I went to test your pre-trained model on a 5-second audio clip from my local news broadcast.
I built and used your Dockerfile (thank you, so nice to see people using Docker) and then ran the image using make bash
. At that point I ran a very simple bit of python below.
import librosa
import json
SAMPLING_RATE = 16000
wave, _ = librosa.load("cbs8.6am.2019.03.28.full_16kHz-3362.94+5.84.wav", sr=SAMPLING_RATE)
result = {
"inputs": {
"audio": wave.tolist(),
"length": len(wave)
}
}
with open("payload.json", 'w') as f_out:
f_out.write(json.dumps(result))
Test:
curl -d @payload.json -X POST http://{my_ip_address}:8501/v1/models/speech:predict > output.json
The output
{
"outputs": {
"text": [
[
"s",
"c",
"l",
"o",
"a",
"i",
"e",
"w",
"e",
"r",
"e",
" ",
"l",
"a",
"h",
" ",
"d",
"l",
"r",
"e",
" ",
"i",
" ",
"a",
"t",
" ",
"-",
"r",
"i",
"s",
"p",
"d",
" ",
"o",
"t",
" ",
"o",
" ",
"t",
"o",
"u",
"n",
"u",
" ",
"s",
"a",
"m",
" ",
"s",
"o",
" ",
"r",
" ",
"v",
"i",
"e",
"c",
"t",
" ",
"d",
"n"
]
],
"logits": []
}
}
Text: scloaiewere lah dlre i at -rispd ot o tounu sam so r viect dn
Actual Transcript:
A STUDY FROM GLASS DOOR ECONOMIC RESEARCH FOUND WOMEN ON AVERAGE EARN 79 CENTS FOR EVER DOLLAR
Obviously I'm doing something wrong passing the WAV to tensorflow/serving.
Input file: https://www.dropbox.com/s/ioncjg5dcbd08p3/cbs8.6am.2019.03.28.full_16kHz-3362.94%2B5.84.wav?dl=0