speech-recognition Inference question

Inference question

Open sberryman opened this issue 5 years ago • 0 comments

Thanks for the excellent blog post, looking forward to training my own model when I get back from vacation.

In the mean time I went to test your pre-trained model on a 5-second audio clip from my local news broadcast.

I built and used your Dockerfile (thank you, so nice to see people using Docker) and then ran the image using make bash. At that point I ran a very simple bit of python below.

import librosa
import json

SAMPLING_RATE = 16000
wave, _ = librosa.load("cbs8.6am.2019.03.28.full_16kHz-3362.94+5.84.wav", sr=SAMPLING_RATE)

result = {
    "inputs": {
        "audio": wave.tolist(),
        "length": len(wave)
    }
}

with open("payload.json", 'w') as f_out:
    f_out.write(json.dumps(result))

Test:

curl -d @payload.json -X POST http://{my_ip_address}:8501/v1/models/speech:predict > output.json

The output

{
    "outputs": {
        "text": [
            [
                "s",
                "c",
                "l",
                "o",
                "a",
                "i",
                "e",
                "w",
                "e",
                "r",
                "e",
                " ",
                "l",
                "a",
                "h",
                " ",
                "d",
                "l",
                "r",
                "e",
                " ",
                "i",
                " ",
                "a",
                "t",
                " ",
                "-",
                "r",
                "i",
                "s",
                "p",
                "d",
                " ",
                "o",
                "t",
                " ",
                "o",
                " ",
                "t",
                "o",
                "u",
                "n",
                "u",
                " ",
                "s",
                "a",
                "m",
                " ",
                "s",
                "o",
                " ",
                "r",
                " ",
                "v",
                "i",
                "e",
                "c",
                "t",
                " ",
                "d",
                "n"
            ]
        ],
        "logits": []
    }
}

Text: scloaiewere lah dlre i at -rispd ot o tounu sam so r viect dn

Actual Transcript:

A STUDY FROM GLASS DOOR ECONOMIC RESEARCH FOUND WOMEN ON AVERAGE EARN 79 CENTS FOR EVER DOLLAR

Obviously I'm doing something wrong passing the WAV to tensorflow/serving.

Input file: https://www.dropbox.com/s/ioncjg5dcbd08p3/cbs8.6am.2019.03.28.full_16kHz-3362.94%2B5.84.wav?dl=0

Apr 01 '19 23:04 sberryman

speech-recognition speech-recognition copied to clipboard

Inference question

speech-recognition
speech-recognition copied to clipboard