kaldi Faster Cuda Decoder

There were several issues recently discovered with the cuda decoder in both offline and online mode.

After my fixes, I can achieve 7800 RTFx throughput on librispeech test-clean and the model https://kaldi-asr.org/models/m13 with an A100-80GB PCIe card in the offline mode of computation. Previously, because of some unnoticed software regressions, this number was as low as 4000 RTFx, which isn't bad, admittedly.

Latency is more complicated, but here is a preliminary result with this model https://kaldi-asr.org/models/m13 on librispeech test-clean:

This was achieved via the following hyperparameter sweep:

for chunk_size in 21 30 40 50; do
    for num_streaming_channels in 1000 2000 3000 4000 5000 6000; do
        max_batch_size=$((num_streaming_channels>4000 ? 4000 : num_streaming_channels))
        /home/dgalvez/scratch/code/asr/kaldi-a100-perf//src/cudadecoderbin/batched-wav-nnet3-cuda-online --num-channels=$((num_streaming_channels * 2)) --cuda-use-tensor-cores=true --main-q-capacity=30\
000 --aux-q-capacity=400000 --cuda-memory-proportion=0.5 --max-batch-size=$max_batch_size --cuda-worker-threads=12 --file-limit=-1 --cuda-decoder-copy-threads=4 --batching-copy-threads=8 --frame-subsam\
pling-factor=3 --frames-per-chunk=$chunk_size --max-mem=100000000 --beam=10 --lattice-beam=7 --acoustic-scale=1.0 --determinize-lattice=true --max-active=10000 --iterations=10 --file-limit=-1 --config=\
/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//conf/online.conf --num-parallel-streaming-channels=$num_streaming_channels --word-symbol-table=/home/dgalvez/scratch/code/a\
sr/kaldi-a100-perf/workspace//models/LibriSpeech//words.txt /home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//models/LibriSpeech//final.mdl /home/dgalvez/scratch/code/asr/kaldi-a100-perf/worksp\
ace//models/LibriSpeech//HCLG.fst scp:/home/dgalvez/scratch/code/asr/kaldi-a100-perf/workspace//datasets/LibriSpeech/test_clean//wav_conv.scp 'ark:|gzip -c > /tmp/results/LibriSpeech/52/0/lat.gz' # 2> \
output.log                                                                                                                                                                                                
        cat output.log | grep -A 1 "Latencies" | grep -v "Latencies" | awk 'BEGIN { OFS = ","; ORS = ""} {print $3,$4,$5,$6}' >> $result_file
        echo ",${chunk_size},${num_streaming_channels},${max_batch_size}" >> $result_file
    done
done

Do note that better results can be achieved sometimes by setting maximum batch size lower than the number of channels. Average latency is, of course, much smaller. This means users can do real-time decoding at 3000-4000 audio streams concurrently.

This is the "compute" latency. It doesn't include the time spent waiting for the right hand context (21 frames, or 210 ms in this case). The point is that it is incredibly fast.

Dec 13 '22 17:12 galv

FYI, CI is faling with:

extras/check_dependencies.sh: python2.7 is not installed
extras/check_dependencies.sh: Some prerequisites are missing; install them using the command:
  sudo apt-get install python2.7
make: *** [Makefile:39: check_required_programs] Error 1

Dec 13 '22 18:12 galv

Fixed CI (so far)

Dec 13 '22 19:12 galv

FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well.

Dec 13 '22 22:12 galv

@galv thread-pool-light.h is deleted from the latest commit, but batched-threaded-nnet3-cuda-pipeline.h still calls it.

Dec 14 '22 07:12 ravi-shanker-m

@ravi-shanker-m that's been deprecated for a few years now: https://github.com/kaldi-asr/kaldi/blob/be22248e3a166d9ec52c78dac945f471e7c3a8aa/src/cudadecoder/batched-threaded-nnet3-cuda-pipeline.h#L35-L36

I'm happy to go ahead and remove that code.

Were you using it for some reason?

Dec 14 '22 16:12 galv

For this question, I've tried both cuda decoder v1 vs v2 and v1 give better RTF in my case so my old service using this implementation. Maybe nvidia provided kaldi docker with parameter optimized for their computing resource and I have not tried enough

Dec 14 '22 16:12 trunglebka

@trunglebka, I'm happy to provide advice if you give more detail. I would sincerely doubt that you reach anywhere near 8000 RTFx on the v1 cuda decoder on an A100 (or whatever GPU you are using).

The nvidia kaldi container is not anything special. It's just a pre-built kaldi from open source with some CI to make sure that nothing has broken. You can reproduce my work by running the librispeech model I linked in the first comment on librispeech test-clean, using the command line flags I specify.

Dec 14 '22 16:12 galv

I've retired from my old company. In my case, after some experiment of tuning parameters using nvidia kaldi docker with T4, v1 give me about 500 RTFx but v2 just about 350 RTFx. Due to deadline I do not have enough time to experiment more so I just pick V1. So I think it maybe the problem with choosing parameters.

Dec 14 '22 17:12 trunglebka

@trunglebka Okay. I found several performance problems with the v2 decoder during my work on making this PR and this is very close to the "speed of light", so I'm not concerned about the v1 decoder being any better than this one.

Dec 14 '22 17:12 galv

Yeah, just want to provide you context where v1 being used.

Dec 14 '22 17:12 trunglebka

FYI @danpovey you might find these very low latencies exciting. I'm going to be incorporating this into https://github.com/nvidia-riva/riva-asrlib-decoder (via the kaldi submodule within that project) so that CTC models (and hopefully something like your FSA-based RNN-T decoder) can benefit as well.

Yes, that's cool! Thanks!

Dec 15 '22 14:12 danpovey

@galv plz merge once you feel it's complete

Feb 13 '23 21:02 jtrmal

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

Apr 26 '23 01:04 stale[bot]

@galv good to merge?

Apr 26 '23 08:04 jtrmal

Hi, could you tell were ivectors used for 7800 RTFx? Config file for ivectors is not passed in the script above. And for chunk size = 30 batched-wav-nnet3-cuda-online gives Assertion failed: ("Please set --frames-per-chunk at least as large as the neural net " "right context" && input_frames_per_chunk_ >= total_nnet_right_context_)

May 30 '23 08:05 zulkarneev

@zulkarneev does the same issue happen with the previous version of the decoder?

May 30 '23 10:05 danpovey

Dan, what version do you mean?

May 30 '23 13:05 zulkarneev

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

Aug 10 '23 04:08 stale[bot]

kaldi kaldi copied to clipboard

Faster Cuda Decoder

kaldi
kaldi copied to clipboard