DeepSpeech icon indicating copy to clipboard operation
DeepSpeech copied to clipboard

batching during inferencing

Open abuvaneswari opened this issue 6 years ago • 21 comments

Hello, Does native_client support inferencing of > 1 audio file at the same time? I am looking to use my GPU for inferencing and optimize the utilization by batching the requests from multiple audio files.

abuvaneswari avatar Sep 18 '17 18:09 abuvaneswari

my program to test all a csv file : https://pastebin.mozilla.org/9032686 it exports time and numbers of inferences done

capture : https://pastebin.mozilla.org/9032687

elpimous avatar Sep 18 '17 22:09 elpimous

Inferencing of a 15 sec long wav file on a GTX 1080 GPU takes 7 seconds. Is that expected? It seems to be a long time to me.

abuvaneswari avatar Sep 20 '17 17:09 abuvaneswari

Do you use existing binaries ? Or did you compile native_client ? (Compile with cuda option)

elpimous avatar Sep 20 '17 18:09 elpimous

yes. Compiled native_client with CUDA option.

abuvaneswari avatar Sep 20 '17 18:09 abuvaneswari

Well, on my small tx2, without overclocking, I just did a test on batch containing 101 wav (average 3s/wav) 101 inferences in 48.903s, so inference takes wavetime/3. You just have to compare gpu boards in a net bench

elpimous avatar Sep 20 '17 19:09 elpimous

my program to test all a csv file : https://pastebin.mozilla.org/9032686 it exports time and numbers of inferences done

capture : https://pastebin.mozilla.org/9032687

hi @elpimous , thanks for the answer. the links are broken now, could you please share it again ? Best regards

nicolaspanel avatar Nov 28 '18 22:11 nicolaspanel

@elpimous @kdavis-mozilla It would be great to have this feature. I can work on the PR but since it will take some time to develop, could you first confirm that it is something you are interested in ? best regards

nicolaspanel avatar Dec 05 '18 14:12 nicolaspanel

@nicolaspanel I find it interesting @reuben what's your take?

kdavis-mozilla avatar Dec 05 '18 21:12 kdavis-mozilla

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

reuben avatar Dec 05 '18 21:12 reuben

It would be great if we have this feature for deep speech pre-build binary as well as, Inferencing more than one audio file at the same time. Currently, I've written a python script and passing audio file name one by one.

nullbyte91 avatar Dec 06 '18 05:12 nullbyte91

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

Right now, python client.py looks like

ds = Model(...)
fin = wave.open(args.audio, 'rb')
fs = fin.getframerate()  # 16000
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)  # audio.shape => (n_frames,)
transcript = ds.stt(audio, fs)

we could just add Model#stt_batch -> List[str] method expecting (BATCH_SIZE, max(n_frames),) array and framerate inputs.

ds = Model(...)
fin = wave.open(args.audio, 'rb')
fs = fin.getframerate()  # 16000
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)  # audio.shape => (n_frames,)
audios = audio.reshape((1, audio.shape[0]))
transcript = ds.stt_batch(audios, fs)

the tricky part is of course the underlying DeepSpeech/native_client/deepspeech.cc code

nicolaspanel avatar Dec 06 '18 13:12 nicolaspanel

(Caveat: We're currently running on 0.2.0-alpha.7, but my findings seem to be consistent with everyone else)

I was just about to put through a feature request rather similar to this.

In our work we've noticed an inference rate of about 0.3s to every 1s of audio uploaded, but we can also see that both the CPU and GPU are underutilised. In CPU only inference on my 4 core (8 thread) laptop the CPU hovers around 25% during inference and achieves only slower than real time inference.

With a GPU we achieve the previously mentioned (by @elpimous ) 0.3s to inference for every 1 second of audio, but even there we can see that the GPU is pretty underutilised by inspecting nvidia-smi. At the moment, for our 300 hours of audio, GPU inference still takes about 5-6 days to run completely, which is disappointingly slow.

We can spin up other instances, or more expensive instances but if we could achieve a significant enough speedup by batching this would be a big win for evaluating and optimising new models.

All of this is to say, thanks @nicolaspanel for looking into this. I'll also be keen to hear how much of an impact this makes on local inference times.

mathematiguy avatar Dec 30 '18 11:12 mathematiguy

At the moment, for our 300 hours of audio, GPU inference still takes about 5-6 days to run completely, which is disappointingly slow.

It might help if you could explain your usecase.

lissyx avatar Dec 30 '18 11:12 lissyx

I’m just transcribing audio to evaluate our model performance. We want evaluations to be fast so we can try out lots of model parameters.

I could sample instead, but since it’s the Christmas break I didn’t mind leaving a long running job.

It seems to me that batching jobs will also help speed up our current transcriptions api which can take audio files over an hour long. Also it sits on an aws ec2 instance at the moment, so faster inference means we can possibly reduce costs.

mathematiguy avatar Dec 31 '18 00:12 mathematiguy

For evaluating model performance I'd strongly encourage you to use evaluate.py rather than the clients, as it's optimized for throughput rather than latency. It takes the same arguments as DeepSpeech.py but only does evaluation, so it'll only look at test_files, test_batch_size, etc. You'll need to point it at a checkpoint (with --checkpoint_dir) rather than at a frozen model.

reuben avatar Dec 31 '18 00:12 reuben

@nicolaspanel I'm definitely interested in having this feature. Do you have an idea of what the batch API would look like?

@reuben @kdavis-mozilla since everyone here seems interested by such feature, maybe we could include it in incoming releases (https://github.com/mozilla/DeepSpeech/projects). What do you think ?

Like I said, I can contribute if needed

nicolaspanel avatar Feb 12 '19 12:02 nicolaspanel

Chiming in again- I didn't have access to evaluate.py since we were on an old 0.2.0 alpha release. But having updated to 0.4.1, I've been very pleased by evaluate.py's performance. 5 hours to transcribe 300 hours of audio on one GPU machine, bloody awesome.

So consider me happy, where this is concerned.

mathematiguy avatar Feb 12 '19 14:02 mathematiguy

@nicolaspanel Are you working on this feature? I am also interested in such a feature.

CP-4 avatar Nov 05 '19 04:11 CP-4

@reuben Has someone started working on that feature?

phtephanx avatar Jan 16 '20 17:01 phtephanx

It would be great if we have this feature for deep speech pre-build binary as well as, Inferencing more than one audio file at the same time. Currently, I've written a python script and passing audio file name one by one.

@nullbyte91 I'm trying to run audio fie one by one through PythonScript in deepspeech 0.6.1 model. Could you please help me out.

rakeshku93 avatar Apr 27 '20 04:04 rakeshku93

For evaluating model performance I'd strongly encourage you to use evaluate.py rather than the clients, as it's optimized for throughput rather than latency. It takes the same arguments as DeepSpeech.py but only does evaluation, so it'll only look at test_files, test_batch_size, etc. You'll need to point it at a checkpoint (with --checkpoint_dir) rather than at a frozen model.

It is also possible to load a frozen graph like this (posting it just in case someone else needs it). Model_path was added to the FLAGS.

if FLAGS.model_path:
    with tfv1.gfile.FastGFile(FLAGS.model_path, 'rb') as fin:
        graph_def = tfv1.GraphDef()
        graph_def.ParseFromString(fin.read())

    var_names = [v.name for v in tfv1.trainable_variables()]
    var_tensors = tfv1.import_graph_def(graph_def, return_elements=var_names)

    # build a { var_name: var_tensor } dict
    var_tensors = dict(zip(var_names, var_tensors))

    training_graph = tfv1.get_default_graph()

    assign_ops = []
    for name, restored_tensor in var_tensors.items():
        training_tensor = training_graph.get_tensor_by_name(name)
        assign_ops.append(tfv1.assign(training_tensor, restored_tensor))

    init_from_frozen_model_op = tfv1.group(*assign_ops)
    session.run(init_from_frozen_model_op)
else:
    load_graph_for_evaluation(session)

vidklopcic avatar Nov 17 '20 17:11 vidklopcic