pgibbs
pgibbs copied to clipboard
what(): std::bad_alloc
The process terminated when trained with 1000000 tokens. With following error message
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)
Thanks for the report. Could you give me the exact command that you ran? If you can share the data with me as well that would be ideal.
This is the comment I executed. ./pgibbs-ws -iters 10000 -threads 100 -blocksize 16 -sampmeth block -skipiters 0 -samphyp true All1.txt output-ws -sampmeth block -skipiters 0 -samphyp true All1.txt output-ws Main arguments: 0: All1.txt 1: output-ws Optional arguments: -verbose 0 -iters 10000 -threads 100 -blocksize 16 -shuffle true -skipiters 0 -printmod false -sampmeth block -randseed 0 -sampparam true -n 2 -maxlen 8 -avglen 2.0 -samphyp true -usepy false -str 0.1 -disc 0.0 -stra 1.0 -disca 1.0 -strb 1.0 -discb 1.0
maxLen = 8 terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)
It is due to allocation failure and related to readinging of file. I think if we use mmap we can solve the problem. Solving this problem is important , if we want to train the system with large amount of raw text files. This part public:
WordCorpus() { }
// load the corpus, and pad on either side with words if necessary
void load(istream & in, bool padSent = false, int padId = -1) {
string line,str;
while(getline(in,line)) {
istringstream iss(line);
vector<int> vals;
if(padSent) vals.push_back(padId);
while(iss >> str)
vals.push_back(ids_.getId(str,true));
if(padSent) vals.push_back(padId);
push_back(vals);
}
}
https://the.sketchengine.co.uk/bonito/run.cgi/wordlist?corpname=preloaded/malayalamwac;usesubcorp=;wlattr=word;wlminfreq=5;wlmaxfreq=0;wlpat=.%2A;wlicase=0;wlmaxitems=100;wlsort=f;ref_corpname=;ref_usesubcorp=;wlcache=;simple_n=1;wltype=simple;wlnums=frq;include_nonwords=0;blcache=;wlpage=2;usengrams=0;ngrams_n=2;complement_subc=0 You can download the corpus from Sketch Engine, Malayalam
Thanks Malkitti. I wasn't able to download the file, but I've trained on data sets 10 times that size without problems, so it's hard to think that data size is an issue. Maybe you can run pgibbs using a debugger like gdb or valgrind and see where the error is occurring.
Separately to the actual error, I did notice that you are using more threads than the size of the block. This is not a useful setting, so I added some more documentation about this, and had the program die with an error if using this sort of setting.
I have no way to share the original corpus. I want to train the system with 10 million tokens which is a large and it is encoded using UTF 8. I can provide you a small set consisting of 500K tokens Unicode. The part of block size also bit difficult for me to understand, when it comes to sampling. According to data size also we need to set iteration level, I think. When it comes to 1 billion tokens, it's better to use mmap.
Any way I got another error message when I try to put the flag -printmod true
terminate called after throwing an instance of 'std::runtime_error' what(): WSModel::print not implemented yet