vosk-asterisk icon indicating copy to clipboard operation
vosk-asterisk copied to clipboard

Is there anyway to control how long after the last word is spoken before Vosk closes the session?

Open dv8inpp opened this issue 2 years ago • 16 comments

Is there anyway to control how long after the last word is spoken before Vosk closes the session?

I am using the Python implementation and would like to limit how long the system will wait before closing the session.

Are there any parameter files I can create?

python3 ./asr_server.py /opt/vosk-model-en/model

dv8inpp avatar Apr 06 '22 03:04 dv8inpp

Hi,

Same thing, could you please confirm parameter which sets the max silence threshold as currently it looks like it's very short.

muyousif avatar Apr 25 '22 04:04 muyousif

also looking at this

Goddard avatar Aug 23 '22 14:08 Goddard

You can change the following params in model.conf:

--endpoint.rule2.min-trailing-silence=0.5
--endpoint.rule3.min-trailing-silence=1.0
--endpoint.rule4.min-trailing-silence=2.0

You can equally scale them up.

nshmyrev avatar Aug 23 '22 14:08 nshmyrev

Thanks for your response. When using this will this stop the audio stream from asterisk server to my websocket server from ending before the call ends?

Goddard avatar Aug 23 '22 15:08 Goddard

Thanks for your response. When using this will this stop the audio stream from asterisk server to my websocket server from ending before the call ends?

No, current module stops the stream after every result. This is how asterisk speech module works unfortunately. It would be nice to have some long transcription mode though.

nshmyrev avatar Aug 23 '22 15:08 nshmyrev

I see, the method I am also working on as an alternative due to this limitation is using a different method.

This plugin - https://github.com/nadirhamid/asterisk-audiofork

It provides a continuous audio stream, but of course doesn't work with vosk out of the box. What do you think would be needed to adapt the code to just use this binary data audio stream?

Goddard avatar Aug 23 '22 15:08 Goddard

What do you think would be needed to adapt the code to just use this binary data audio stream?

You can just adapt backend server https://github.com/nadirhamid/audiofork-transcribe-demo, no need to update audiofork module itself, it should work the same way.

nshmyrev avatar Aug 23 '22 20:08 nshmyrev

The audiofork transcribe demo is using google's closed source transcription.

How would I adapt it? Especially if I wanted to use an open source option.

Goddard avatar Aug 23 '22 20:08 Goddard

Ok, I made a script, but I am getting significant slow downs. I've tried configuring beam and other things you have suggested but still the result lags behind. This is on a CPU. Any things I can try to improve speed to be almost real time?

https://gist.github.com/Goddard/b86c0469c42e1f4c415f37354a5f30db

Goddard avatar Oct 04 '22 16:10 Goddard

Any things I can try to improve speed to be almost real time?

What is your hardware and how many streams are you trying to process

nshmyrev avatar Oct 04 '22 16:10 nshmyrev

In my tests I am only doing 1 stream. The results are okay, but the transcription says it is processing in about .2 to .9 ms. The time it takes to actually show the results in the terminal is like a second or three though. This seems to compound over time.

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz Stepping: 4 CPU MHz: 2600.095 BogoMIPS: 5200.00 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase smep erms xsaveopt md_clear flush_l1d

Goddard avatar Oct 04 '22 16:10 Goddard

The results are okay, but the transcription says it is processing in about .2 to .9 ms.

This is very small delay.

How much memory do you have?

nshmyrev avatar Oct 04 '22 20:10 nshmyrev

64 gigs

That is what the time reports, but I was thinking the process would be asynchronous between transcriptions so they wouldn't build up to taking longer than the person speaking.

Unless i have an issue with my script I don't see a way to increase the speed. Because that is just the transcription time for each partial. It could have several partials. If you have 20 partials adding up to .2 or .9 that equals sometimes a 10 second delay to get a full transcription.

Does vosk-api use a VAD as well? Do you think this would speed it up?

Goddard avatar Oct 04 '22 21:10 Goddard

Do you see this delay with asterisk-audiofork module or with vosk-asterisk?

nshmyrev avatar Oct 04 '22 21:10 nshmyrev

Even when using something only locally I get poor results for example - https://github.com/alphacep/vosk-server/tree/master/websocket-microphone

python will claim the transcription process only took m.s. but really it takes a few seconds for the data to print to the screen.

Sometimes it will take 4 seconds for the text to be printed to the terminal. I don't think it is a situation where python is slow because even the websocket-cpp boost beast appears to lag behind considerably.

The vosk asterisk plugin appears to be a bit faster, but the transcription ends before the entire call ends so it isn't very useful.

I just installed using a python virtual environment and pip requirements.txt on Ubuntu 22.04.

My local machine is a newer intel cpu with 64 gigs as well Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz CPU family: 6 Model: 141 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 CPU max MHz: 4600.0000 CPU min MHz: 800.0000 BogoMIPS: 4608.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc _deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ep t_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetb v1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all):
L1d: 384 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 10 MiB (8 instances) L3: 24 MiB (1 instance) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerabilities:
Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Srbds: Not affected Tsx async abort: Not affected

Only thing I see is WARNING (VoskAPI:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (25158432,1108448,23744520), after rebuilding, repo size was 21053120, effective beam was 5.49789 vs. requested beam 6 WARNING (VoskAPI:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (27757600,743808,21513912), after rebuilding, repo size was 24994144, effective beam was 4.20504 vs. requested beam 6

Goddard avatar Oct 04 '22 23:10 Goddard

for example using boost beast websocket example provided, it takes approxamately 4 seconds for the speech recognition to print.

I used the websocket microphone example connected to a remote boost beast websocket example. INFO:root:{ "text" : "testing testing one two three" }

But even locally I experience the same thing. Would a GPU be faster then that?

Goddard avatar Oct 05 '22 12:10 Goddard