mecab icon indicating copy to clipboard operation
mecab copied to clipboard

Problems when training

Open coleea opened this issue 7 years ago • 2 comments

Hello. now I am trying to use 'mecab-cost-train'. I have a corpus file that contains over 10,000,000 lines. when run, It uses over 300GB Memory. So, It is impossible to finish 'mecab-cost-train'. why It is requred to use so many memory? In my guess, generating too muching 'EncoderLearnerTagger' object is the cause of this problem. (115 line of https://github.com/taku910/mecab/blob/32041d9504d11683ef80a6556173ff43f79d1268/mecab/src/learner.cpp#L142 )

There are any solutions to train corpus that contains over 10,000,000 lines ?. Thank you.

coleea avatar May 24 '17 06:05 coleea

I used to get this all the time too There's some option I remember seeing in the official doc which I tried but did not work In the end I content myself by dividing the corpus and retraining successively Not an ideal solution but a workaround I'd like to hear from others too Yo

Sent from my iPhone

On 24 May 2017, at 07:48, coleea [email protected] wrote:

Hello. now I am trying to use 'mecab-cost-train'. I have a corpus file that contains over 10,000,000 lines. when run, It uses over 300GB Memory. So, It is impossible to finish 'mecab-cost-train'. why It is requred to use so many memory? In my guess, generating too muching 'EncoderLearnerTagger' object is the cause of this problem. (115 line of https://github.com/taku910/mecab/blob/32041d9504d11683ef80a6556173ff43f79d1268/mecab/src/learner.cpp#L142 )

There are any solutions to train corpus that contains over 10,000,000 lines ?. Thank you.

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

yosato avatar May 24 '17 18:05 yosato

you saved me. thank you

coleea avatar Jul 03 '17 09:07 coleea