kenlm
kenlm copied to clipboard
getting segmentation error while generating lm.binary
hi, i am getting below error while generating lm.binary file
build/bin/build_binary -T working/training_materials -s trie working/language_models/words.arpa working/language_models/lm.binary Reading working/language_models/words.arpa ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 *************************************************************************Segmentation fault (core dumped)
have you ever resolved this issue?
i have same problem:
subprocess.CalledProcessError: Command '['/DeepSpeech/native_client/kenlm/build/bin/build_binary', '-a', '255', '-q', '8', '-v', 'trie', '/opt/data/czech-scorer/lm_filtered.arpa', '/opt/data/czech-scorer/lm.binary']' died with <Signals.SIGSEGV: 11>.
https://www.softwaretestinghelp.com/how-to-write-good-bug-report/
I uploaded an ARPA file that reproduces this issue when calling build_binary
on it with the trie
option.
https://drive.google.com/file/d/1OVKLK42gzQn4cQUZcxpy2aZJliVEWiKy/view?usp=sharing
Solution for me was to use probing
instead of trie
.
Unable to reproduce.
heafield@meili:~/kenlm/build$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
heafield@meili:~/kenlm/build$ uname -a
Linux meili 5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
git clone [email protected]:kpu/kenlm # revision bbf4fc511266c5d4515047055d7bdec659a6e158
cd kenlm/
mkdir build
cd build/
cmake ..
make -j8
bin/build_binary ~/lm_sigseg.arpa delme.bin
bin/build_binary trie ~/lm_sigseg.arpa delme.bin
Relevant output (probing):
Reading /home/heafield/lm_sigseg.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
Relevant output (trie):
Reading /home/heafield/lm_sigseg.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
There was previously a segfault with machines that have hugepages manually configured but that was fixed in master a while ago; try this repo not deepspeech.
I tried it with KenLM built from source (master). I'll try again and give more information.
Could you try this file: https://drive.google.com/file/d/1H6rBHUKQPypepjfqzdclLIqPhwPmhdAg/view?usp=sharing
$ bin/build_binary trie lm_sigseg_2.arpa out.bin
Reading .../lm_sigseg_2.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Segmentation fault (core dumped)
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04 LTS"
$ uname -a
... #1 SMP Fri Dec 27 22:27:31 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Similar error for me as well. I created the arpa file from this text file: text.txt
wc text.txt
104 104 685 text.txt
head text.txt
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
fusce
non
tail text.txt
mauris
accumsan
auctor
in
ac
finibus
nisi
aenean
commodo
ultricies
git log -1
commit f01e12d83c7fd03ebe6656e0ad6d73a3e022bd50 (HEAD -> master, origin/master, origin/HEAD)
Merge: bbf4fc5 fbb6da7
Author: Kenneth Heafield <[email protected]>
Date: Tue Nov 2 10:42:46 2021 +0000
Merge pull request #360 from kkm000/patch-1
Don't replace linker flags set by toolset and user
bin/lmplz --order 3 --text text.txt --arpa text.arpa --discount_fallback
=== 1/5 Counting and sorting n-grams ===
Reading text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 104 types 80
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:960 2:4639081472 3:8698277888
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Statistics:
1 80 D1=0.5 D2=1 D3+=1.5
2 154 D1=0.85 D2=1.15 D3+=2.15
3 77 D1=0.710843 D2=1.28916 D3+=2.28916
Memory estimate for binary LM:
type B
probing 7172 assuming -p 1.5
probing 8420 assuming -r models -p 1.5
trie 3839 without quantization
trie 5785 assuming -q 8 -b 8 quantization
trie 3836 assuming -a 22 array pointer compression
trie 5782 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:960 2:2464 3:1540
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:960 2:2464 3:1540
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:13188780 kB VmRSS:6168 kB RSSMax:3029460 kB user:0.360779 sys:0.737594 CPU:1.09848 real:1.17481
bin/build_binary trie text.arpa text.binary
Reading text.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Segmentation fault (core dumped)
Interestingly, it works if I remove the last line/word from my text file - "ultricies" OR if I add one more line - "dapibus".
Hi, I compiled with debug mode and got the following stack trace after running:
build_binary trie lm_filtered.arpa lm.binary
Program received signal SIGSEGV, Segmentation fault.
0x0000556b8d3833fe in util::FreePool::Allocate (this=0x7ffcc32233e0)
at /opt/kenlm/lm/../util/pool.hh:95
95 free_list_ = *reinterpret_cast<void**>(free_list_);
(gdb) bt
#0 0x0000556b8d3833fe in util::FreePool::Allocate (this=0x7ffcc32233e0)
at /opt/kenlm/lm/../util/pool.hh:95
#1 0x0000556b8d383645 in util::ValueBlock::ValueBlock (this=0x7ffcc32228f0, from=...)
at /opt/kenlm/lm/../util/sized_iterator.hh:62
#2 0x0000556b8d39c922 in std::__adjust_heap<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, util::ValueBlock, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=...,
__holeIndex=59, __len=62, __value=..., __comp=...) at /usr/include/c++/7/bits/stl_heap.h:237
#3 0x0000556b8d39c1c5 in std::__make_heap<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., __comp=...)
at /usr/include/c++/7/bits/stl_heap.h:342
#4 0x0000556b8d39b8f6 in std::__heap_select<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __middle=..., __last=...,
__comp=...) at /usr/include/c++/7/bits/stl_algo.h:1672
#5 0x0000556b8d39b421 in std::__partial_sort<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __middle=..., __last=...,
__comp=...) at /usr/include/c++/7/bits/stl_algo.h:1933
#6 0x0000556b8d39b248 in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=0, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1948
#7 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=1, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#8 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::Entry---Type <return> to continue, or q <return> to quit---
Compare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=3, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#9 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=5, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#10 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=7, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#11 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=9, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#12 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=11, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#13 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=13, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#14 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=14, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#15 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=15, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
---Type <return> to continue, or q <return> to quit---
#16 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=16, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#17 0x0000556b8d39b2ab in std::__introsort_loop<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, long, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=...,
__depth_limit=17, __comp=...) at /usr/include/c++/7/bits/stl_algo.h:1954
#18 0x0000556b8d39b0ed in std::__sort<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, __gnu_cxx::__ops::_Iter_comp_iter<util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > > (__first=..., __last=..., __comp=...)
at /usr/include/c++/7/bits/stl_algo.h:1968
#19 0x0000556b8d39a84e in std::sort<util::ProxyIterator<lm::ngram::trie::(anonymous namespace)::PartialViewProxy>, util::SizedCompare<lm::ngram::trie::EntryCompare, lm::ngram::trie::(anonymous namespace)::PartialViewProxy> > (__first=..., __last=..., __comp=...) at /usr/include/c++/7/bits/stl_algo.h:4868
#20 0x0000556b8d399195 in lm::ngram::trie::(anonymous namespace)::WriteContextFile (
begin=0x556b8e971020 "\001", end=0x556b8e974ea0 "", temp_prefix="lrw_language_model/lm.binary",
entry_size=16, order=2 '\002') at /opt/kenlm/lm/trie_sort.cc:104
#21 0x0000556b8d39a4c9 in lm::ngram::trie::SortedFiles::ConvertToSorted (this=0x7ffcc3223740, f=...,
vocab=..., counts=std::vector of length 5, capacity 8 = {...},
file_prefix="lrw_language_model/lm.binary", order=2 '\002', warn=..., mem=0x556b8e971020,
mem_size=16000) at /opt/kenlm/lm/trie_sort.cc:284
#22 0x0000556b8d399dc2 in lm::ngram::trie::SortedFiles::SortedFiles (this=0x7ffcc3223740, config=...,
f=..., counts=std::vector of length 5, capacity 8 = {...}, buffer=16000,
file_prefix="lrw_language_model/lm.binary", vocab=...) at /opt/kenlm/lm/trie_sort.cc:229
#23 0x0000556b8d385195 in lm::ngram::trie::TrieSearch<lm::ngram::SeparatelyQuantize, lm::ngram::trie::ArrayBhiksha>::InitializeFromARPA (this=0x7ffcc3224490,
file=0x7ffcc3224d46 "lrw_language_model/lm_filtered.arpa", f=...,
counts=std::vector of length 5, capacity 8 = {...}, config=..., vocab=..., backing=...)
at /opt/kenlm/lm/search_trie.cc:585
#24 0x0000556b8d363656 in lm::ngram::detail::GenericModel<lm::ngram::trie::TrieSearch<lm::ngram::SeparatelyQuantize, lm::ngram::trie::ArrayBhiksha>, lm::ngram::SortedVocabulary>::InitializeFromARPA (
this=0x7ffcc3224310, fd=3, file=0x7ffcc3224d46 "lrw_language_model/lm_filtered.arpa", config=...)
at /opt/kenlm/lm/model.cc:118
#25 0x0000556b8d362204 in lm::ngram::detail::GenericModel<lm::ngram::trie::TrieSearch<lm::ngram::SeparatelyQuantize, lm::ngram::trie::ArrayBhiksha>, lm::ngram::SortedVocabulary>::GenericModel (
this=0x7ffcc3224310, file=0x7ffcc3224d46 "lrw_language_model/lm_filtered.arpa", init_config=...)
at /opt/kenlm/lm/model.cc:76
#26 0x0000556b8d3554f3 in lm::ngram::QuantArrayTrieModel::QuantArrayTrieModel (this=0x7ffcc3224310,
file=0x7ffcc3224d46 "lrw_language_model/lm_filtered.arpa", config=...)
at /opt/kenlm/lm/model.hh:141
#27 0x0000556b8d354799 in main (argc=9, argv=0x7ffcc3224a78) at /opt/kenlm/lm/build_binary_main.cc:215
Just a hack to get it working for my use-case: If you change method https://github.com/kpu/kenlm/blob/f01e12d83c7fd03ebe6656e0ad6d73a3e022bd50/util/pool.hh#L92 to
void *Allocate() {
return backing_.Allocate(element_size_);
}
and recompile, it works
Does 5cea457 fix this?
@Domhnall-Liopa Thanks for the debugging tip! That will cause a memory leak though so I'm hesitant to use that.
@kpu Yes, 5cea457 fixes it for me! Thanks!
@kpu yes that fixes it. Thanks very much