k2
k2 copied to clipboard
Batch parallel CPU decoding.
With CPU inferencing I can leverage the Pytorch
number of threads for conformer inference by simply adding torch.set_num_threads(desired_num)
and it does the thing, I'm observing almost linear speedup of the conformer inference with the increasing number of threads.
Now during HLG decoding, It seems always perform single-threaded decoding and there is no way to change this. I wonder if it is possible to implement parallelization over utterances in the batch. It sounds like a natural way to parallelize this task and by doing it on c++
level one can avoid any Python overhead and keep the main python pipeline simple "single-threaded" inference. I know that in Sherpa
you prefer to handle threads in Python and release GIL when it's necessary, but for my use case, it would be nice to have the option to execute _k2.intersect_dense_pruned
in parallel, providing the number of threads as a parameter.
Will it be difficult to implement?
Ok, I look through the code and I believe the k2.OnlineDenseIntersecter
should do the thing, it has a parameter num_streams
which as far as I understand should be the number of concurrent streams that are equal to batch size.
I tried this parameter, but unfortunately don't observe any speed up, the decoding process is still single-threaded. Am I doing anything wrong?
In general, I just want to solve the following task: I have log_probs
of (batch_size, seq_len, num_bpe)
obtained from the neural net, and all I want is to perform concurrent-threads decoding, making parallelization over the samples in the batch.
@csukuangfj @pkufool Can you please advise me on how to do it properly?
k2 is not optimized for CPU. Can you start as many threads as num_batch to process the data?
Note: if you are using Python, you may need to change the python binding code in k2 to release the GIL of Python.