pocketsphinx icon indicating copy to clipboard operation
pocketsphinx copied to clipboard

pocketsphinx_continuous is unable to convert speech to text

Open ghost opened this issue 4 years ago • 11 comments

Hi All, I recently compiled the pocket sphinx on MIPS architecture and then I recorded a 16 bit 16000 Hz mono audio and then I tried to run pocketsphinx_continuous using the following command pocketsphinx_continuous -hmm en-us/ -lm TAR9897/9897.lm -dict TAR9897/9897.dic -infile input.wav

But I am unable to get any speech from the audio but when I tried to do the same in my laptop it was working. Please help

I am sharing the logs below

INFO: pocketsphinx.c(153): Parsed model-specific feature parameters from en-us//feat.params
Current configuration:
[NAME]			[DEFLT]		[VALUE]
-agc			none		none
-agcthresh		2.0		2.000000e+00
-allphone				
-allphone_ci		yes		yes
-alpha			0.97		9.700000e-01
-ascale			20.0		2.000000e+01
-aw			1		1
-backtrace		no		no
-beam			1e-48		1.000000e-48
-bestpath		yes		yes
-bestpathlw		9.5		9.500000e+00
-ceplen			13		13
-cmn			live		batch
-cmninit		40,3,-1		41.00,-5.29,-0.12,5.09,2.48,-4.07,-1.37,-1.78,-5.08,-2.05,-6.45,-1.42,1.17
-compallsen		no		no
-dict					TAR9897/9897.dic
-dictcase		no		no
-dither			no		no
-doublebw		no		no
-ds			1		1
-fdict					
-feat			1s_c_d_dd	1s_c_d_dd
-featparams				
-fillprob		1e-8		1.000000e-08
-frate			100		100
-fsg					
-fsgusealtpron		yes		yes
-fsgusefiller		yes		yes
-fwdflat		yes		yes
-fwdflatbeam		1e-64		1.000000e-64
-fwdflatefwid		4		4
-fwdflatlw		8.5		8.500000e+00
-fwdflatsfwin		25		25
-fwdflatwbeam		7e-29		7.000000e-29
-fwdtree		yes		yes
-hmm					en-us/
-input_endian		little		little
-jsgf					
-keyphrase				
-kws					
-kws_delay		10		10
-kws_plp		1e-1		1.000000e-01
-kws_threshold		1e-30		1.000000e-30
-latsize		5000		5000
-lda					
-ldadim			0		0
-lifter			0		22
-lm					TAR9897/9897.lm
-lmctl					
-lmname					
-logbase		1.0001		1.000100e+00
-logfn					
-logspec		no		no
-lowerf			133.33334	1.300000e+02
-lpbeam			1e-40		1.000000e-40
-lponlybeam		7e-29		7.000000e-29
-lw			6.5		6.500000e+00
-maxhmmpf		30000		30000
-maxwpf			-1		-1
-mdef					
-mean					
-mfclogdir				
-min_endfr		0		0
-mixw					
-mixwfloor		0.0000001	1.000000e-07
-mllr					
-mmap			yes		yes
-ncep			13		13
-nfft			512		512
-nfilt			40		25
-nwpen			1.0		1.000000e+00
-pbeam			1e-48		1.000000e-48
-pip			1.0		1.000000e+00
-pl_beam		1e-10		1.000000e-10
-pl_pbeam		1e-10		1.000000e-10
-pl_pip			1.0		1.000000e+00
-pl_weight		3.0		3.000000e+00
-pl_window		5		5
-rawlogdir				
-remove_dc		no		no
-remove_noise		yes		yes
-remove_silence		yes		yes
-round_filters		yes		yes
-samprate		16000		1.600000e+04
-seed			-1		-1
-sendump				
-senlogdir				
-senmgau				
-silprob		0.005		5.000000e-03
-smoothspec		no		no
-svspec					0-12/13-25/26-38
-tmat					
-tmatfloor		0.0001		1.000000e-04
-topn			4		4
-topn_beam		0		0
-toprule				
-transform		legacy		dct
-unit_area		yes		yes
-upperf			6855.4976	6.800000e+03
-uw			1.0		1.000000e+00
-vad_postspeech		50		50
-vad_prespeech		20		20
-vad_startspeech	10		10
-vad_threshold		3.0		3.000000e+00
-var					
-varfloor		0.0001		1.000000e-04
-varnorm		no		no
-verbose		no		no
-warp_params				
-warp_type		inverse_linear	inverse_linear
-wbeam			7e-29		7.000000e-29
-wip			0.65		6.500000e-01
-wlen			0.025625	2.562500e-02

INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
INFO: acmod.c(162): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(518): Reading model definition: en-us//mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef file
INFO: bin_mdef.c(337): Reading binary model definition: en-us//mdef
INFO: bin_mdef.c(517): 42 CI-phone, 137053 CD-phone, 3 emitstate/phone, 126 CI-sen, 5126 Sen, 29324 Sen-Seq
INFO: tmat.c(149): Reading HMM transition probability matrices: en-us//transition_matrices
INFO: acmod.c(113): Attempting to use PTM computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: en-us//means
INFO: ms_gauden.c(242): 42 codebook, 3 feature, size: 
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: en-us//variances
INFO: ms_gauden.c(242): 42 codebook, 3 feature, size: 
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(304): 222 variance values floored
INFO: ptm_mgau.c(475): Loading senones from dump file en-us//sendump
INFO: ptm_mgau.c(499): BEGIN FILE FORMAT DESCRIPTION
INFO: ptm_mgau.c(562): Rows: 128, Columns: 5126
INFO: ptm_mgau.c(594): Using memory-mapped I/O for senones
INFO: ptm_mgau.c(837): Maximum top-N: 4
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 4130 * 20 bytes (80 KiB) for word entries
INFO: dict.c(333): Reading main dictionary: TAR9897/9897.dic
INFO: dict.c(213): Dictionary size 29, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(336): 29 words read
INFO: dict.c(358): Reading filler dictionary: en-us//noisedict
INFO: dict.c(213): Dictionary size 34, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 5 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 42^3 * 2 bytes (144 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 21336 bytes (20 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 21336 bytes (20 KiB) for single-phone word triphones
INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
INFO: ngram_model_trie.c(365): Header doesn't match
INFO: ngram_model_trie.c(177): Trying to read LM in arpa format
INFO: ngram_model_trie.c(193): LM of order 3
INFO: ngram_model_trie.c(195): #1-grams: 26
INFO: ngram_model_trie.c(195): #2-grams: 41
INFO: ngram_model_trie.c(195): #3-grams: 36
INFO: lm_trie.c(474): Training quantizer
INFO: lm_trie.c(482): Building LM trie
INFO: ngram_search_fwdtree.c(74): Initializing search tree
INFO: ngram_search_fwdtree.c(101): 26 unique initial diphones
INFO: ngram_search_fwdtree.c(186): Creating search channels
INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 202
INFO: ngram_search_fwdtree.c(333): Created 26 root, 74 non-root channels, 5 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Mar 10 2020, AT: 16:01:07

INFO: cmn_live.c(120): Update from < 41.00 -5.29 -0.12  5.09  2.48 -4.07 -1.37 -1.78 -5.08 -2.05 -6.45 -1.42  1.17 >
INFO: cmn_live.c(138): Update to   < 43.50 12.27  5.78  3.19 -3.29  0.34 -5.26 -11.33  4.52  0.05 -2.95 11.94 -3.49 >
INFO: ngram_search_fwdtree.c(1550):      525 words recognized (6/fr)
INFO: ngram_search_fwdtree.c(1552):    27719 senones evaluated (292/fr)
INFO: ngram_search_fwdtree.c(1556):    15927 channels searched (167/fr), 2366 1st, 9410 last
INFO: ngram_search_fwdtree.c(1559):      724 words for which last channels evaluated (7/fr)
INFO: ngram_search_fwdtree.c(1561):      876 candidate words for entering last phone (9/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 1.34 CPU 1.406 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 1.37 wall 1.446 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 10 words
INFO: ngram_search_fwdflat.c(948):      547 words recognized (6/fr)
INFO: ngram_search_fwdflat.c(950):    13371 senones evaluated (141/fr)
INFO: ngram_search_fwdflat.c(952):    11350 channels searched (119/fr)
INFO: ngram_search_fwdflat.c(954):      879 words searched (9/fr)
INFO: ngram_search_fwdflat.c(957):      312 word transitions (3/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.45 CPU 0.474 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.48 wall 0.501 xRT
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.88
INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
INFO: ngram_search.c(1381): Lattice has 211 nodes, 433 links
INFO: ps_lattice.c(1376): Bestpath score: -1763
INFO: ps_lattice.c(1380): Normalizer P(O) = alpha(</s>:88:93) = -105689
INFO: ps_lattice.c(1437): Joint P(O,S) = -110043 P(S|O) = -4354
INFO: ngram_search.c(872): bestpath 0.01 CPU 0.006 xRT
INFO: ngram_search.c(875): bestpath 0.01 wall 0.008 xRT

INFO: ngram_search_fwdtree.c(429): TOTAL fwdtree 1.34 CPU 1.421 xRT
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 1.37 wall 1.461 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.45 CPU 0.479 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.48 wall 0.507 xRT
INFO: ngram_search.c(303): TOTAL bestpath 0.01 CPU 0.006 xRT
INFO: ngram_search.c(306): TOTAL bestpath 0.01 wall 0.008 xRT
 

ghost avatar Mar 12 '20 07:03 ghost

mips big endian or little endian?

nshmyrev avatar Mar 12 '20 07:03 nshmyrev

Hi @nshmyrev This is my other github account . The mips architecture follows little endian in our device when i tried to run another command

pocketsphinx_continuous  -hmm en-us/ -lm TAR9897/9897.lm  -dict TAR9897/9897.dic  -adcdev sysdefault  -inmic Yes


I got the following logs

INFO: pocketsphinx.c(153): Parsed model-specific feature parameters from en-us//feat.params
Current configuration:
[NAME]			[DEFLT]		[VALUE]
-agc			none		none
-agcthresh		2.0		2.000000e+00
-allphone				
-allphone_ci		yes		yes
-alpha			0.97		9.700000e-01
-ascale			20.0		2.000000e+01
-aw			1		1
-backtrace		no		no
-beam			1e-48		1.000000e-48
-bestpath		yes		yes
-bestpathlw		9.5		9.500000e+00
-ceplen			13		13
-cmn			live		batch
-cmninit		40,3,-1		41.00,-5.29,-0.12,5.09,2.48,-4.07,-1.37,-1.78,-5.08,-2.05,-6.45,-1.42,1.17
-compallsen		no		no
-dict					TAR9897/9897.dic
-dictcase		no		no
-dither			no		no
-doublebw		no		no
-ds			1		1
-fdict					
-feat			1s_c_d_dd	1s_c_d_dd
-featparams				
-fillprob		1e-8		1.000000e-08
-frate			100		100
-fsg					
-fsgusealtpron		yes		yes
-fsgusefiller		yes		yes
-fwdflat		yes		yes
-fwdflatbeam		1e-64		1.000000e-64
-fwdflatefwid		4		4
-fwdflatlw		8.5		8.500000e+00
-fwdflatsfwin		25		25
-fwdflatwbeam		7e-29		7.000000e-29
-fwdtree		yes		yes
-hmm					en-us/
-input_endian		little		little
-jsgf					
-keyphrase				
-kws					
-kws_delay		10		10
-kws_plp		1e-1		1.000000e-01
-kws_threshold		1e-30		1.000000e-30
-latsize		5000		5000
-lda					
-ldadim			0		0
-lifter			0		22
-lm					TAR9897/9897.lm
-lmctl					
-lmname					
-logbase		1.0001		1.000100e+00
-logfn					
-logspec		no		no
-lowerf			133.33334	1.300000e+02
-lpbeam			1e-40		1.000000e-40
-lponlybeam		7e-29		7.000000e-29
-lw			6.5		6.500000e+00
-maxhmmpf		30000		30000
-maxwpf			-1		-1
-mdef					
-mean					
-mfclogdir				
-min_endfr		0		0
-mixw					
-mixwfloor		0.0000001	1.000000e-07
-mllr					
-mmap			yes		yes
-ncep			13		13
-nfft			512		512
-nfilt			40		25
-nwpen			1.0		1.000000e+00
-pbeam			1e-48		1.000000e-48
-pip			1.0		1.000000e+00
-pl_beam		1e-10		1.000000e-10
-pl_pbeam		1e-10		1.000000e-10
-pl_pip			1.0		1.000000e+00
-pl_weight		3.0		3.000000e+00
-pl_window		5		5
-rawlogdir				
-remove_dc		no		no
-remove_noise		yes		yes
-remove_silence		yes		yes
-round_filters		yes		yes
-samprate		16000		1.600000e+04
-seed			-1		-1
-sendump				
-senlogdir				
-senmgau				
-silprob		0.005		5.000000e-03
-smoothspec		no		no
-svspec					0-12/13-25/26-38
-tmat					
-tmatfloor		0.0001		1.000000e-04
-topn			4		4
-topn_beam		0		0
-toprule				
-transform		legacy		dct
-unit_area		yes		yes
-upperf			6855.4976	6.800000e+03
-uw			1.0		1.000000e+00
-vad_postspeech		50		50
-vad_prespeech		20		20
-vad_startspeech	10		10
-vad_threshold		3.0		3.000000e+00
-var					
-varfloor		0.0001		1.000000e-04
-varnorm		no		no
-verbose		no		no
-warp_params				
-warp_type		inverse_linear	inverse_linear
-wbeam			7e-29		7.000000e-29
-wip			0.65		6.500000e-01
-wlen			0.025625	2.562500e-02

INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
INFO: acmod.c(162): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(518): Reading model definition: en-us//mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef file
INFO: bin_mdef.c(337): Reading binary model definition: en-us//mdef
INFO: bin_mdef.c(517): 42 CI-phone, 137053 CD-phone, 3 emitstate/phone, 126 CI-sen, 5126 Sen, 29324 Sen-Seq
INFO: tmat.c(149): Reading HMM transition probability matrices: en-us//transition_matrices
INFO: acmod.c(113): Attempting to use PTM computation module
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: en-us//means
INFO: ms_gauden.c(242): 42 codebook, 3 feature, size: 
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(127): Reading mixture gaussian parameter: en-us//variances
INFO: ms_gauden.c(242): 42 codebook, 3 feature, size: 
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(244):  128x13
INFO: ms_gauden.c(304): 222 variance values floored
INFO: ptm_mgau.c(475): Loading senones from dump file en-us//sendump
INFO: ptm_mgau.c(499): BEGIN FILE FORMAT DESCRIPTION
INFO: ptm_mgau.c(562): Rows: 128, Columns: 5126
INFO: ptm_mgau.c(594): Using memory-mapped I/O for senones
INFO: ptm_mgau.c(837): Maximum top-N: 4
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 4130 * 20 bytes (80 KiB) for word entries
INFO: dict.c(333): Reading main dictionary: TAR9897/9897.dic
INFO: dict.c(213): Dictionary size 29, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(336): 29 words read
INFO: dict.c(358): Reading filler dictionary: en-us//noisedict
INFO: dict.c(213): Dictionary size 34, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 5 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 42^3 * 2 bytes (144 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 21336 bytes (20 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 21336 bytes (20 KiB) for single-phone word triphones
INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
INFO: ngram_model_trie.c(365): Header doesn't match
INFO: ngram_model_trie.c(177): Trying to read LM in arpa format
INFO: ngram_model_trie.c(193): LM of order 3
INFO: ngram_model_trie.c(195): #1-grams: 26
INFO: ngram_model_trie.c(195): #2-grams: 41
INFO: ngram_model_trie.c(195): #3-grams: 36
INFO: lm_trie.c(474): Training quantizer
INFO: lm_trie.c(482): Building LM trie
INFO: ngram_search_fwdtree.c(74): Initializing search tree
INFO: ngram_search_fwdtree.c(101): 26 unique initial diphones
INFO: ngram_search_fwdtree.c(186): Creating search channels
INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 202
INFO: ngram_search_fwdtree.c(333): Created 26 root, 74 non-root channels, 5 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Mar 10 2020, AT: 16:01:07

INFO: continuous.c(252): Ready....
Input overrun, read calls are too rare (non-fatal)
Input overrun, read calls are too rare (non-fatal)
Input overrun, read calls are too rare (non-fatal)
Input overrun, read calls are too rare (non-fatal)
Input overrun, read calls are too rare (non-fatal)

WilliamVJacob avatar Mar 12 '20 12:03 WilliamVJacob

Try to recognize a file with jsgf grammar.

nshmyrev avatar Mar 12 '20 12:03 nshmyrev

As for the second log, decoding is too slow on your device, probably it is very tiny cpu and the model needs optimization.

nshmyrev avatar Mar 12 '20 12:03 nshmyrev

Thanks for the prompt response Can you give me little more details on optimizing this code

WilliamVJacob avatar Mar 13 '20 11:03 WilliamVJacob

@WilliamVJacob sure, describe me in more details the application you want to build and the hardware you have in mind.

nshmyrev avatar Mar 13 '20 11:03 nshmyrev

I want to run pocketsphinx on an embedded device with 2 USB mic , 256 RAM , 128 MB Flash The processor is ALI M3627-AL single core. I want to build an application where the user will say some commands and it will will convert it into speech, I have already prepared the dictionary and language model and tested on laptop but its not working on the embedded device

WilliamVJacob avatar Mar 16 '20 05:03 WilliamVJacob

Hi, pocketsphinx_continuous no longer exists, but, it seems we still have some issues with mipsel architecture. If you are able to try the current code (with pocketsphinx_batch) I would be interested to see the results.

dhdaines avatar Jun 13 '22 11:06 dhdaines

why is pocketsphinx_continuous not available anymore?

zavalyshyn avatar Jun 13 '22 23:06 zavalyshyn

No more audio library. If there's a reasonable use case for pocketsphinx_continuous we could reimplement it for specific platforms, and there will certainly be a test (there is already I think) and example code for the continuous listening API. Were people using pocketsphinx_continuous to implement actual applications?

dhdaines avatar Jun 13 '22 23:06 dhdaines

Also you'll note that the original reporter is using pocketsphinx_continuous to do batch mode recognition... which should be done with pocketsphinx_batch ... which will get a bit of a makeover to be more user-friendly.

dhdaines avatar Jun 13 '22 23:06 dhdaines

Hi, I would very much like to fix this as well as #252 but I don't have access to a mipsel machine for testing. The best I can do will be to validate it with QEMU, which is hopefully a faithful replication of the architecture.

dhdaines avatar Sep 28 '22 12:09 dhdaines