pocketsphinx copied to clipboard
pocketsphinx_continuous is unable to convert speech to text
Hi All, I recently compiled the pocket sphinx on MIPS architecture and then I recorded a 16 bit 16000 Hz mono audio and then I tried to run pocketsphinx_continuous using the following command pocketsphinx_continuous -hmm en-us/ -lm TAR9897/9897.lm -dict TAR9897/9897.dic -infile input.wav
But I am unable to get any speech from the audio but when I tried to do the same in my laptop it was working. Please help
I am sharing the logs below
INFO: pocketsphinx.c(153): Parsed model-specific feature parameters from en-us//feat.params
Current configuration:
-agc none none
-agcthresh 2.0 2.000000e+00
-allphone_ci yes yes
-alpha 0.97 9.700000e-01
-ascale 20.0 2.000000e+01
-aw 1 1
-backtrace no no
-beam 1e-48 1.000000e-48
-bestpath yes yes
-bestpathlw 9.5 9.500000e+00
-ceplen 13 13
-cmn live batch
-cmninit 40,3,-1 41.00,-5.29,-0.12,5.09,2.48,-4.07,-1.37,-1.78,-5.08,-2.05,-6.45,-1.42,1.17
-compallsen no no
-dict TAR9897/9897.dic
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-feat 1s_c_d_dd 1s_c_d_dd
-fillprob 1e-8 1.000000e-08
-frate 100 100
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-64
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+00
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-29
-fwdtree yes yes
-hmm en-us/
-input_endian little little
-kws_delay 10 10
-kws_plp 1e-1 1.000000e-01
-kws_threshold 1e-30 1.000000e-30
-latsize 5000 5000
-ldadim 0 0
-lifter 0 22
-lm TAR9897/9897.lm
-logbase 1.0001 1.000100e+00
-logspec no no
-lowerf 133.33334 1.300000e+02
-lpbeam 1e-40 1.000000e-40
-lponlybeam 7e-29 7.000000e-29
-lw 6.5 6.500000e+00
-maxhmmpf 30000 30000
-maxwpf -1 -1
-min_endfr 0 0
-mixwfloor 0.0000001 1.000000e-07
-mmap yes yes
-ncep 13 13
-nfft 512 512
-nfilt 40 25
-nwpen 1.0 1.000000e+00
-pbeam 1e-48 1.000000e-48
-pip 1.0 1.000000e+00
-pl_beam 1e-10 1.000000e-10
-pl_pbeam 1e-10 1.000000e-10
-pl_pip 1.0 1.000000e+00
-pl_weight 3.0 3.000000e+00
-pl_window 5 5
-remove_dc no no
-remove_noise yes yes
-remove_silence yes yes
-round_filters yes yes
-samprate 16000 1.600000e+04
-seed -1 -1
-silprob 0.005 5.000000e-03
-smoothspec no no
-svspec 0-12/13-25/26-38
-tmatfloor 0.0001 1.000000e-04
-topn 4 4
-topn_beam 0 0
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 6.800000e+03
-uw 1.0 1.000000e+00
-vad_postspeech 50 50
-vad_prespeech 20 20
-vad_startspeech 10 10
-vad_threshold 3.0 3.000000e+00
-varfloor 0.0001 1.000000e-04
-varnorm no no
-verbose no no
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 7.000000e-29
-wip 0.65 6.500000e-01
-wlen 0.025625 2.562500e-02
INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
INFO: acmod.c(162): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(518): Reading model definition: en-us//mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef file
INFO: bin_mdef.c(337): Reading binary model definition: en-us//mdef
INFO: bin_mdef.c(517): 42 CI-phone, 137053 CD-phone, 3 emitstate/phone, 126 CI-sen, 5126 Sen, 29324 Sen-Seq
INFO: tmat.c(149): Reading HMM transition probability matrices: en-us//transition_matrices
INFO: acmod.c(113): Attempting to use PTM computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: en-us//means
INFO: ms_gauden.c(242): 42 codebook, 3 feature, size:
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: en-us//variances
INFO: ms_gauden.c(242): 42 codebook, 3 feature, size:
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(304): 222 variance values floored
INFO: ptm_mgau.c(475): Loading senones from dump file en-us//sendump
INFO: ptm_mgau.c(562): Rows: 128, Columns: 5126
INFO: ptm_mgau.c(594): Using memory-mapped I/O for senones
INFO: ptm_mgau.c(837): Maximum top-N: 4
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 4130 * 20 bytes (80 KiB) for word entries
INFO: dict.c(333): Reading main dictionary: TAR9897/9897.dic
INFO: dict.c(213): Dictionary size 29, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(336): 29 words read
INFO: dict.c(358): Reading filler dictionary: en-us//noisedict
INFO: dict.c(213): Dictionary size 34, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 5 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 42^3 * 2 bytes (144 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 21336 bytes (20 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 21336 bytes (20 KiB) for single-phone word triphones
INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
INFO: ngram_model_trie.c(365): Header doesn't match
INFO: ngram_model_trie.c(177): Trying to read LM in arpa format
INFO: ngram_model_trie.c(193): LM of order 3
INFO: ngram_model_trie.c(195): #1-grams: 26
INFO: ngram_model_trie.c(195): #2-grams: 41
INFO: ngram_model_trie.c(195): #3-grams: 36
INFO: lm_trie.c(474): Training quantizer
INFO: lm_trie.c(482): Building LM trie
INFO: ngram_search_fwdtree.c(74): Initializing search tree
INFO: ngram_search_fwdtree.c(101): 26 unique initial diphones
INFO: ngram_search_fwdtree.c(186): Creating search channels
INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 202
INFO: ngram_search_fwdtree.c(333): Created 26 root, 74 non-root channels, 5 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Mar 10 2020, AT: 16:01:07
INFO: cmn_live.c(120): Update from < 41.00 -5.29 -0.12 5.09 2.48 -4.07 -1.37 -1.78 -5.08 -2.05 -6.45 -1.42 1.17 >
INFO: cmn_live.c(138): Update to < 43.50 12.27 5.78 3.19 -3.29 0.34 -5.26 -11.33 4.52 0.05 -2.95 11.94 -3.49 >
INFO: ngram_search_fwdtree.c(1550): 525 words recognized (6/fr)
INFO: ngram_search_fwdtree.c(1552): 27719 senones evaluated (292/fr)
INFO: ngram_search_fwdtree.c(1556): 15927 channels searched (167/fr), 2366 1st, 9410 last
INFO: ngram_search_fwdtree.c(1559): 724 words for which last channels evaluated (7/fr)
INFO: ngram_search_fwdtree.c(1561): 876 candidate words for entering last phone (9/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 1.34 CPU 1.406 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 1.37 wall 1.446 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 10 words
INFO: ngram_search_fwdflat.c(948): 547 words recognized (6/fr)
INFO: ngram_search_fwdflat.c(950): 13371 senones evaluated (141/fr)
INFO: ngram_search_fwdflat.c(952): 11350 channels searched (119/fr)
INFO: ngram_search_fwdflat.c(954): 879 words searched (9/fr)
INFO: ngram_search_fwdflat.c(957): 312 word transitions (3/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.45 CPU 0.474 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.48 wall 0.501 xRT
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.88
INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
INFO: ngram_search.c(1381): Lattice has 211 nodes, 433 links
INFO: ps_lattice.c(1376): Bestpath score: -1763
INFO: ps_lattice.c(1380): Normalizer P(O) = alpha(</s>:88:93) = -105689
INFO: ps_lattice.c(1437): Joint P(O,S) = -110043 P(S|O) = -4354
INFO: ngram_search.c(872): bestpath 0.01 CPU 0.006 xRT
INFO: ngram_search.c(875): bestpath 0.01 wall 0.008 xRT
INFO: ngram_search_fwdtree.c(429): TOTAL fwdtree 1.34 CPU 1.421 xRT
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 1.37 wall 1.461 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.45 CPU 0.479 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.48 wall 0.507 xRT
INFO: ngram_search.c(303): TOTAL bestpath 0.01 CPU 0.006 xRT
INFO: ngram_search.c(306): TOTAL bestpath 0.01 wall 0.008 xRT
mips big endian or little endian?
Hi @nshmyrev This is my other github account . The mips architecture follows little endian in our device when i tried to run another command
pocketsphinx_continuous -hmm en-us/ -lm TAR9897/9897.lm -dict TAR9897/9897.dic -adcdev sysdefault -inmic Yes
I got the following logs
INFO: pocketsphinx.c(153): Parsed model-specific feature parameters from en-us//feat.params
Current configuration:
-agc none none
-agcthresh 2.0 2.000000e+00
-allphone_ci yes yes
-alpha 0.97 9.700000e-01
-ascale 20.0 2.000000e+01
-aw 1 1
-backtrace no no
-beam 1e-48 1.000000e-48
-bestpath yes yes
-bestpathlw 9.5 9.500000e+00
-ceplen 13 13
-cmn live batch
-cmninit 40,3,-1 41.00,-5.29,-0.12,5.09,2.48,-4.07,-1.37,-1.78,-5.08,-2.05,-6.45,-1.42,1.17
-compallsen no no
-dict TAR9897/9897.dic
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-feat 1s_c_d_dd 1s_c_d_dd
-fillprob 1e-8 1.000000e-08
-frate 100 100
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-64
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+00
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-29
-fwdtree yes yes
-hmm en-us/
-input_endian little little
-kws_delay 10 10
-kws_plp 1e-1 1.000000e-01
-kws_threshold 1e-30 1.000000e-30
-latsize 5000 5000
-ldadim 0 0
-lifter 0 22
-lm TAR9897/9897.lm
-logbase 1.0001 1.000100e+00
-logspec no no
-lowerf 133.33334 1.300000e+02
-lpbeam 1e-40 1.000000e-40
-lponlybeam 7e-29 7.000000e-29
-lw 6.5 6.500000e+00
-maxhmmpf 30000 30000
-maxwpf -1 -1
-min_endfr 0 0
-mixwfloor 0.0000001 1.000000e-07
-mmap yes yes
-ncep 13 13
-nfft 512 512
-nfilt 40 25
-nwpen 1.0 1.000000e+00
-pbeam 1e-48 1.000000e-48
-pip 1.0 1.000000e+00
-pl_beam 1e-10 1.000000e-10
-pl_pbeam 1e-10 1.000000e-10
-pl_pip 1.0 1.000000e+00
-pl_weight 3.0 3.000000e+00
-pl_window 5 5
-remove_dc no no
-remove_noise yes yes
-remove_silence yes yes
-round_filters yes yes
-samprate 16000 1.600000e+04
-seed -1 -1
-silprob 0.005 5.000000e-03
-smoothspec no no
-svspec 0-12/13-25/26-38
-tmatfloor 0.0001 1.000000e-04
-topn 4 4
-topn_beam 0 0
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 6.800000e+03
-uw 1.0 1.000000e+00
-vad_postspeech 50 50
-vad_prespeech 20 20
-vad_startspeech 10 10
-vad_threshold 3.0 3.000000e+00
-varfloor 0.0001 1.000000e-04
-varnorm no no
-verbose no no
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 7.000000e-29
-wip 0.65 6.500000e-01
-wlen 0.025625 2.562500e-02
INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
INFO: acmod.c(162): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(518): Reading model definition: en-us//mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef file
INFO: bin_mdef.c(337): Reading binary model definition: en-us//mdef
INFO: bin_mdef.c(517): 42 CI-phone, 137053 CD-phone, 3 emitstate/phone, 126 CI-sen, 5126 Sen, 29324 Sen-Seq
INFO: tmat.c(149): Reading HMM transition probability matrices: en-us//transition_matrices
INFO: acmod.c(113): Attempting to use PTM computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: en-us//means
INFO: ms_gauden.c(242): 42 codebook, 3 feature, size:
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: en-us//variances
INFO: ms_gauden.c(242): 42 codebook, 3 feature, size:
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(244): 128x13
INFO: ms_gauden.c(304): 222 variance values floored
INFO: ptm_mgau.c(475): Loading senones from dump file en-us//sendump
INFO: ptm_mgau.c(562): Rows: 128, Columns: 5126
INFO: ptm_mgau.c(594): Using memory-mapped I/O for senones
INFO: ptm_mgau.c(837): Maximum top-N: 4
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 4130 * 20 bytes (80 KiB) for word entries
INFO: dict.c(333): Reading main dictionary: TAR9897/9897.dic
INFO: dict.c(213): Dictionary size 29, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(336): 29 words read
INFO: dict.c(358): Reading filler dictionary: en-us//noisedict
INFO: dict.c(213): Dictionary size 34, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 5 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 42^3 * 2 bytes (144 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 21336 bytes (20 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 21336 bytes (20 KiB) for single-phone word triphones
INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
INFO: ngram_model_trie.c(365): Header doesn't match
INFO: ngram_model_trie.c(177): Trying to read LM in arpa format
INFO: ngram_model_trie.c(193): LM of order 3
INFO: ngram_model_trie.c(195): #1-grams: 26
INFO: ngram_model_trie.c(195): #2-grams: 41
INFO: ngram_model_trie.c(195): #3-grams: 36
INFO: lm_trie.c(474): Training quantizer
INFO: lm_trie.c(482): Building LM trie
INFO: ngram_search_fwdtree.c(74): Initializing search tree
INFO: ngram_search_fwdtree.c(101): 26 unique initial diphones
INFO: ngram_search_fwdtree.c(186): Creating search channels
INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 202
INFO: ngram_search_fwdtree.c(333): Created 26 root, 74 non-root channels, 5 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Mar 10 2020, AT: 16:01:07
INFO: continuous.c(252): Ready....
Input overrun, read calls are too rare (non-fatal)
Input overrun, read calls are too rare (non-fatal)
Input overrun, read calls are too rare (non-fatal)
Input overrun, read calls are too rare (non-fatal)
Input overrun, read calls are too rare (non-fatal)
Try to recognize a file with jsgf grammar.
As for the second log, decoding is too slow on your device, probably it is very tiny cpu and the model needs optimization.
Thanks for the prompt response Can you give me little more details on optimizing this code
@WilliamVJacob sure, describe me in more details the application you want to build and the hardware you have in mind.
I want to run pocketsphinx on an embedded device with 2 USB mic , 256 RAM , 128 MB Flash The processor is ALI M3627-AL single core. I want to build an application where the user will say some commands and it will will convert it into speech, I have already prepared the dictionary and language model and tested on laptop but its not working on the embedded device
Hi, pocketsphinx_continuous no longer exists, but, it seems we still have some issues with mipsel architecture. If you are able to try the current code (with pocketsphinx_batch) I would be interested to see the results.
why is pocketsphinx_continuous not available anymore?
No more audio library. If there's a reasonable use case for pocketsphinx_continuous
we could reimplement it for specific platforms, and there will certainly be a test (there is already I think) and example code for the continuous listening API. Were people using pocketsphinx_continuous
to implement actual applications?
Also you'll note that the original reporter is using pocketsphinx_continuous
to do batch mode recognition... which should be done with pocketsphinx_batch
... which will get a bit of a makeover to be more user-friendly.
Hi, I would very much like to fix this as well as #252 but I don't have access to a mipsel machine for testing. The best I can do will be to validate it with QEMU, which is hopefully a faithful replication of the architecture.