CoreNLP icon indicating copy to clipboard operation
CoreNLP copied to clipboard

How big is your truecase model?

Open erksch opened this issue 5 years ago • 8 comments

Hey there!

I've trained a truecase model for the german language on a dataset of 1 million sentences. The resulting model is quite big (80MB) and I am having memory issues including it into my annotating pipeline. But when I use the english truecase model there are no issues.

How big is your truecase model (edu/stanford/nlp/models/truecase/truecasing.fast.caseless.qn.ser.gz)? And how does the annotator impact the memory consumption of the pipeline?

erksch avatar Jan 20 '20 13:01 erksch

Ok I unpacked your jar and found out: truecasing.fast.caseless.qn.ser.gz - 15.8 MB

But how is this possible? You are training on 4.5 million sentences and I am training on only 1 million. I have the exact same configuration except for:

useQN=false (I have true)
l1reg=1.0 (I don't have this line)

because I read somewhere that I can only use QN Minimizer and it is throwing an error when I use that configuration.

Is it because maybe the german language has more words in general (don't know if that's true)?

erksch avatar Jan 20 '20 14:01 erksch

This is my whole training configration
serializeTo=truecasing.fast.caseless.qn.ser.gz
trainFileList=data.train
testFile=data.test

map=word=0,answer=1

wordFunction = edu.stanford.nlp.process.LowercaseFunction

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useLongSequences=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
useOccurrencePatterns=true
useLastRealWord=true
useNextRealWord=true
useDisjunctive=true
disjunctionWidth=5
wordShape=chris2useLC
usePosition=true
useBeginSent=true
useTitle=true

useObservedSequencesOnly=true
saveFeatureIndexToDisk=true
normalize=true

useQN=true
QNSize=25

maxLeft=1

readerAndWriter=edu.stanford.nlp.sequences.TrueCasingForNISTDocumentReaderAndWriter
featureFactory=edu.stanford.nlp.ie.NERFeatureFactory

featureDiffThresh=0.02

erksch avatar Jan 20 '20 14:01 erksch

What happens if you add back the l1reg? That should force weights to 0, which should reduce the size of the final model.

Also, I retrained the model recently on 1.5M sentences and the resulting model is significantly bigger, at 48M.

On Mon, Jan 20, 2020 at 6:46 AM erksch [email protected] wrote:

This is my whole training configration:

serializeTo=truecasing.fast.caseless.qn.ser.gz trainFileList=data.train testFile=data.test

map=word=0,answer=1

wordFunction = edu.stanford.nlp.process.LowercaseFunction

useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useLongSequences=true useSequences=true usePrevSequences=true useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true useOccurrencePatterns=true useLastRealWord=true useNextRealWord=true useDisjunctive=true disjunctionWidth=5 wordShape=chris2useLC usePosition=true useBeginSent=true useTitle=true

useObservedSequencesOnly=true saveFeatureIndexToDisk=true normalize=true

useQN=true QNSize=25

maxLeft=1

readerAndWriter=edu.stanford.nlp.sequences.TrueCasingForNISTDocumentReaderAndWriter featureFactory=edu.stanford.nlp.ie.NERFeatureFactory

featureDiffThresh=0.02

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/986?email_source=notifications&email_token=AA2AYWP2HMDF63S4VSGNG6TQ6W2LBA5CNFSM4KJEKAWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJM3VKQ#issuecomment-576305834, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLQAJWP7ALMX6NO74TQ6W2LBANCNFSM4KJEKAWA .

AngledLuffa avatar Jan 20 '20 17:01 AngledLuffa

When I add l1reg I get the following error:

Exception in thread "main" edu.stanford.nlp.util.ReflectionLoading$ReflectionLoadingException: Error creating edu.stanford.nlp.optimization.OWLQNMinimizer
	at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:38)
	at edu.stanford.nlp.ie.crf.CRFClassifier.getMinimizer(CRFClassifier.java:2003)
	at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1902)
	at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1742)
	at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:785)
	at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:756)
	at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3011)
Caused by: edu.stanford.nlp.util.MetaClass$ClassCreationException: java.lang.ClassNotFoundException: edu.stanford.nlp.optimization.OWLQNMinimizer
	at edu.stanford.nlp.util.MetaClass.createFactory(MetaClass.java:364)
	at edu.stanford.nlp.util.MetaClass.createInstance(MetaClass.java:381)
	at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:36)
	... 6 more
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.optimization.OWLQNMinimizer
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:315)
	at edu.stanford.nlp.util.MetaClass$ClassFactory.construct(MetaClass.java:135)
	at edu.stanford.nlp.util.MetaClass$ClassFactory.<init>(MetaClass.java:202)
	at edu.stanford.nlp.util.MetaClass$ClassFactory.<init>(MetaClass.java:69)
	at edu.stanford.nl

According to the things I read this tries to use the OWLQNMinimizer which is not publicly available in CoreNLP thus the class is not found.

Right, we are not licensed to release that optimizer. [...] add the flag useQN=true

It turns out you also need to turn of l1reg (remove the l1reg=...flag) to use the qn implementation. For all I know, turning off the regularization may make the classifier much worse, unfortunately.

From this thread

erksch avatar Jan 20 '20 17:01 erksch

Ah, that's a good point. I'll check with our PI to see if things have changed in terms of what we can publicly release.

On Mon, Jan 20, 2020 at 9:42 AM erksch [email protected] wrote:

When I add l1reg I get the following error:

Exception in thread "main" edu.stanford.nlp.util.ReflectionLoading$ReflectionLoadingException: Error creating edu.stanford.nlp.optimization.OWLQNMinimizer at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:38) at edu.stanford.nlp.ie.crf.CRFClassifier.getMinimizer(CRFClassifier.java:2003) at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1902) at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1742) at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:785) at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:756) at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3011) Caused by: edu.stanford.nlp.util.MetaClass$ClassCreationException: java.lang.ClassNotFoundException: edu.stanford.nlp.optimization.OWLQNMinimizer at edu.stanford.nlp.util.MetaClass.createFactory(MetaClass.java:364) at edu.stanford.nlp.util.MetaClass.createInstance(MetaClass.java:381) at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:36) ... 6 more Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.optimization.OWLQNMinimizer at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:315) at edu.stanford.nlp.util.MetaClass$ClassFactory.construct(MetaClass.java:135) at edu.stanford.nlp.util.MetaClass$ClassFactory.(MetaClass.java:202) at edu.stanford.nlp.util.MetaClass$ClassFactory.(MetaClass.java:69) at edu.stanford.nl

According to the things I read this tries to use the OWLQNMinimizer which is not publicly available in CoreNLP thus the class is not found.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/986?email_source=notifications&email_token=AA2AYWLLVSFKHKLNCB6NK6TQ6XO7JA5CNFSM4KJEKAWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJNMNAI#issuecomment-576374401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMV2Y6HDRJ766KKOGDQ6XO7JANCNFSM4KJEKAWA .

AngledLuffa avatar Jan 20 '20 17:01 AngledLuffa

Nice, thank you very much! Have you trained with the normal QNMinimizer in your recent training?

erksch avatar Jan 20 '20 20:01 erksch

Sorry (again) for the late reply. What should work is the parameters

useQN=true useOWLQN=true priorLambda= (some hyperparameter)

AngledLuffa avatar Apr 02 '20 21:04 AngledLuffa

Thank you! I think I'll retrain our truecaser in the next days and give the parameters a try!

erksch avatar Apr 03 '20 09:04 erksch