kenlm icon indicating copy to clipboard operation
kenlm copied to clipboard

getting segmentation error while generating lm.binary

Open jhachandan1994 opened this issue 5 years ago • 13 comments

hi, i am getting below error while generating lm.binary file

build/bin/build_binary -T working/training_materials -s trie working/language_models/words.arpa working/language_models/lm.binary Reading working/language_models/words.arpa ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 *************************************************************************Segmentation fault (core dumped)

jhachandan1994 avatar Dec 19 '19 03:12 jhachandan1994

have you ever resolved this issue?

erksch avatar Jul 25 '21 19:07 erksch

i have same problem:

subprocess.CalledProcessError: Command '['/DeepSpeech/native_client/kenlm/build/bin/build_binary', '-a', '255', '-q', '8', '-v', 'trie', '/opt/data/czech-scorer/lm_filtered.arpa', '/opt/data/czech-scorer/lm.binary']' died with <Signals.SIGSEGV: 11>.

cubase avatar Jul 29 '21 12:07 cubase

https://www.softwaretestinghelp.com/how-to-write-good-bug-report/

kpu avatar Jul 29 '21 12:07 kpu

I uploaded an ARPA file that reproduces this issue when calling build_binary on it with the trie option.

https://drive.google.com/file/d/1OVKLK42gzQn4cQUZcxpy2aZJliVEWiKy/view?usp=sharing

Solution for me was to use probing instead of trie.

erksch avatar Jul 29 '21 18:07 erksch

Unable to reproduce.

heafield@meili:~/kenlm/build$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
heafield@meili:~/kenlm/build$ uname -a
Linux meili 5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
git clone [email protected]:kpu/kenlm # revision bbf4fc511266c5d4515047055d7bdec659a6e158
cd kenlm/
mkdir build
cd build/
cmake ..
make -j8
bin/build_binary ~/lm_sigseg.arpa delme.bin
bin/build_binary trie ~/lm_sigseg.arpa delme.bin

Relevant output (probing):

Reading /home/heafield/lm_sigseg.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

Relevant output (trie):

Reading /home/heafield/lm_sigseg.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

There was previously a segfault with machines that have hugepages manually configured but that was fixed in master a while ago; try this repo not deepspeech.

kpu avatar Jul 29 '21 18:07 kpu

I tried it with KenLM built from source (master). I'll try again and give more information.

erksch avatar Jul 29 '21 18:07 erksch

Could you try this file: https://drive.google.com/file/d/1H6rBHUKQPypepjfqzdclLIqPhwPmhdAg/view?usp=sharing

$ bin/build_binary trie lm_sigseg_2.arpa out.bin

Reading .../lm_sigseg_2.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Segmentation fault (core dumped)
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04 LTS"
$ uname -a
... #1 SMP Fri Dec 27 22:27:31 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

erksch avatar Jul 29 '21 19:07 erksch

Similar error for me as well. I created the arpa file from this text file: text.txt

wc text.txt
104 104 685 text.txt
head text.txt 
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
fusce
non
tail text.txt
mauris
accumsan
auctor
in
ac
finibus
nisi
aenean
commodo
ultricies
git log -1
commit f01e12d83c7fd03ebe6656e0ad6d73a3e022bd50 (HEAD -> master, origin/master, origin/HEAD)
Merge: bbf4fc5 fbb6da7
Author: Kenneth Heafield <[email protected]>
Date:   Tue Nov 2 10:42:46 2021 +0000

    Merge pull request #360 from kkm000/patch-1
    
    Don't replace linker flags set by toolset and user
bin/lmplz --order 3 --text text.txt --arpa text.arpa --discount_fallback
=== 1/5 Counting and sorting n-grams ===
Reading text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 104 types 80
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:960 2:4639081472 3:8698277888
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Statistics:
1 80 D1=0.5 D2=1 D3+=1.5
2 154 D1=0.85 D2=1.15 D3+=2.15
3 77 D1=0.710843 D2=1.28916 D3+=2.28916
Memory estimate for binary LM:
type       B
probing 7172 assuming -p 1.5
probing 8420 assuming -r models -p 1.5
trie    3839 without quantization
trie    5785 assuming -q 8 -b 8 quantization 
trie    3836 assuming -a 22 array pointer compression
trie    5782 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:960 2:2464 3:1540
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:960 2:2464 3:1540
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:13188780 kB	VmRSS:6168 kB	RSSMax:3029460 kB	user:0.360779	sys:0.737594	CPU:1.09848	real:1.17481
bin/build_binary trie text.arpa text.binary
Reading text.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Segmentation fault (core dumped)

Interestingly, it works if I remove the last line/word from my text file - "ultricies" OR if I add one more line - "dapibus".

locmene avatar Nov 29 '21 18:11 locmene

Hi, I compiled with debug mode and got the following stack trace after running:

build_binary trie lm_filtered.arpa lm.binary

Program received signal SIGSEGV, Segmentation fault.
0x0000556b8d3833fe in util::FreePool::Allocate (this=0x7ffcc32233e0)
    at /opt/kenlm/lm/../util/pool.hh:95
95	        free_list_ = *reinterpret_cast<void**>(free_list_);
(gdb) bt
#0  0x0000556b8d3833fe in util::FreePool::Allocate (this=0x7ffcc32233e0)
    at /opt/kenlm/lm/../util/pool.hh:95
#1  0x0000556b8d383645 in util::ValueBlock::ValueBlock (this=0x7ffcc32228f0, from=...)
    at /opt/kenlm/lm/../util/sized_iterator.hh:62
#2  0x0000556b8d39c922 in std::__adjust_heap<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, util::ValueBlock, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., 
    __holeIndex=59, __len=62, __value=..., __comp=...) at /usr/include/c++/7/bits/stl_heap.h:237
#3  0x0000556b8d39c1c5 in std::__make_heap<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., __comp=...)
    at /usr/include/c++/7/bits/stl_heap.h:342
#4  0x0000556b8d39b8f6 in std::__heap_select<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __middle=..., __last=..., 
    __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1672
#5  0x0000556b8d39b421 in std::__partial_sort<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __middle=..., __last=..., 
    __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1933
#6  0x0000556b8d39b248 in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=0, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1948
#7  0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=1, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#8  0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::Entry---Type <return> to continue, or q <return> to quit---
Compare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=3, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#9  0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=5, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#10 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=7, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#11 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=9, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#12 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=11, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#13 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=13, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#14 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=14, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#15 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=15, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
---Type <return> to continue, or q <return> to quit---
#16 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=16, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#17 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., 
    __depth_limit=17, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#18 0x0000556b8d39b0ed in std::__sort<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., __comp=...)
    at /usr/include/c++/7/bits/stl_algo.h:1968
#19 0x0000556b8d39a84e in std::sort<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > (__first=..., __last=..., __comp=...) at /usr/include/c++/7/bits/stl_algo.h:4868
#20 0x0000556b8d399195 in lm::ngram::trie::(anonymous namespace)::WriteContextFile (
    begin=0x556b8e971020 "\001", end=0x556b8e974ea0 "", temp_prefix="lrw_language_model/lm.binary", 
    entry_size=16, order=2 '\002') at /opt/kenlm/lm/trie_sort.cc:104
#21 0x0000556b8d39a4c9 in lm::ngram::trie::SortedFiles::ConvertToSorted (this=0x7ffcc3223740, f=..., 
    vocab=..., counts=std::vector of length 5, capacity 8 = {...}, 
    file_prefix="lrw_language_model/lm.binary", order=2 '\002', warn=..., mem=0x556b8e971020, 
    mem_size=16000) at /opt/kenlm/lm/trie_sort.cc:284
#22 0x0000556b8d399dc2 in lm::ngram::trie::SortedFiles::SortedFiles (this=0x7ffcc3223740, config=..., 
    f=..., counts=std::vector of length 5, capacity 8 = {...}, buffer=16000, 
    file_prefix="lrw_language_model/lm.binary", vocab=...) at /opt/kenlm/lm/trie_sort.cc:229
#23 0x0000556b8d385195 in lm::ngram::trie::TrieSearch<lm::ngram::SeparatelyQuantize, lm::ngram::trie::ArrayBhiksha>::InitializeFromARPA (this=0x7ffcc3224490, 
    file=0x7ffcc3224d46 "lrw_language_model/lm_filtered.arpa", f=..., 
    counts=std::vector of length 5, capacity 8 = {...}, config=..., vocab=..., backing=...)
    at /opt/kenlm/lm/search_trie.cc:585
#24 0x0000556b8d363656 in lm::ngram::detail::GenericModel<lm::ngram::trie::TrieSearch<lm::ngram::SeparatelyQuantize, lm::ngram::trie::ArrayBhiksha>, lm::ngram::SortedVocabulary>::InitializeFromARPA (
    this=0x7ffcc3224310, fd=3, file=0x7ffcc3224d46 "lrw_language_model/lm_filtered.arpa", config=...)
    at /opt/kenlm/lm/model.cc:118
#25 0x0000556b8d362204 in lm::ngram::detail::GenericModel<lm::ngram::trie::TrieSearch<lm::ngram::SeparatelyQuantize, lm::ngram::trie::ArrayBhiksha>, lm::ngram::SortedVocabulary>::GenericModel (
    this=0x7ffcc3224310, file=0x7ffcc3224d46 "lrw_language_model/lm_filtered.arpa", init_config=...)
    at /opt/kenlm/lm/model.cc:76
#26 0x0000556b8d3554f3 in lm::ngram::QuantArrayTrieModel::QuantArrayTrieModel (this=0x7ffcc3224310, 
    file=0x7ffcc3224d46 "lrw_language_model/lm_filtered.arpa", config=...)
    at /opt/kenlm/lm/model.hh:141
#27 0x0000556b8d354799 in main (argc=9, argv=0x7ffcc3224a78) at /opt/kenlm/lm/build_binary_main.cc:215

Domhnall-Liopa avatar Dec 08 '21 14:12 Domhnall-Liopa

Just a hack to get it working for my use-case: If you change method https://github.com/kpu/kenlm/blob/f01e12d83c7fd03ebe6656e0ad6d73a3e022bd50/util/pool.hh#L92 to

    void *Allocate() {
        return backing_.Allocate(element_size_);
    }

and recompile, it works

Domhnall-Liopa avatar Dec 09 '21 10:12 Domhnall-Liopa

Does 5cea457 fix this?

@Domhnall-Liopa Thanks for the debugging tip! That will cause a memory leak though so I'm hesitant to use that.

kpu avatar Dec 09 '21 11:12 kpu

@kpu Yes, 5cea457 fixes it for me! Thanks!

locmene avatar Dec 09 '21 11:12 locmene

@kpu yes that fixes it. Thanks very much

Domhnall-Liopa avatar Dec 09 '21 11:12 Domhnall-Liopa