pgibbs what(): std::bad

The process terminated when trained with 1000000 tokens. With following error message

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

Jun 04 '15 14:06 Malkitti

Thanks for the report. Could you give me the exact command that you ran? If you can share the data with me as well that would be ideal.

Jun 04 '15 22:06 neubig

This is the comment I executed. ./pgibbs-ws -iters 10000 -threads 100 -blocksize 16 -sampmeth block -skipiters 0 -samphyp true All1.txt output-ws -sampmeth block -skipiters 0 -samphyp true All1.txt output-ws Main arguments: 0: All1.txt 1: output-ws Optional arguments: -verbose 0 -iters 10000 -threads 100 -blocksize 16 -shuffle true -skipiters 0 -printmod false -sampmeth block -randseed 0 -sampparam true -n 2 -maxlen 8 -avglen 2.0 -samphyp true -usepy false -str 0.1 -disc 0.0 -stra 1.0 -disca 1.0 -strb 1.0 -discb 1.0

maxLen = 8 terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

It is due to allocation failure and related to readinging of file. I think if we use mmap we can solve the problem. Solving this problem is important , if we want to train the system with large amount of raw text files. This part public:

WordCorpus() { }

// load the corpus, and pad on either side with words if necessary
void load(istream & in, bool padSent = false, int padId = -1) {
    string line,str;
    while(getline(in,line)) {
        istringstream iss(line);
        vector<int> vals;
        if(padSent) vals.push_back(padId);
        while(iss >> str)
            vals.push_back(ids_.getId(str,true));
        if(padSent) vals.push_back(padId);
        push_back(vals);
    }
}

Jun 05 '15 07:06 Malkitti

https://the.sketchengine.co.uk/bonito/run.cgi/wordlist?corpname=preloaded/malayalamwac;usesubcorp=;wlattr=word;wlminfreq=5;wlmaxfreq=0;wlpat=.%2A;wlicase=0;wlmaxitems=100;wlsort=f;ref_corpname=;ref_usesubcorp=;wlcache=;simple_n=1;wltype=simple;wlnums=frq;include_nonwords=0;blcache=;wlpage=2;usengrams=0;ngrams_n=2;complement_subc=0 You can download the corpus from Sketch Engine, Malayalam

Jun 05 '15 10:06 Malkitti

Thanks Malkitti. I wasn't able to download the file, but I've trained on data sets 10 times that size without problems, so it's hard to think that data size is an issue. Maybe you can run pgibbs using a debugger like gdb or valgrind and see where the error is occurring.

Separately to the actual error, I did notice that you are using more threads than the size of the block. This is not a useful setting, so I added some more documentation about this, and had the program die with an error if using this sort of setting.

Jun 05 '15 12:06 neubig

I have no way to share the original corpus. I want to train the system with 10 million tokens which is a large and it is encoded using UTF 8. I can provide you a small set consisting of 500K tokens Unicode. The part of block size also bit difficult for me to understand, when it comes to sampling. According to data size also we need to set iteration level, I think. When it comes to 1 billion tokens, it's better to use mmap.

Any way I got another error message when I try to put the flag -printmod true

terminate called after throwing an instance of 'std::runtime_error' what(): WSModel::print not implemented yet

Jun 05 '15 16:06 Malkitti

pgibbs pgibbs copied to clipboard

what(): std::bad_alloc

pgibbs
pgibbs copied to clipboard