punctuator2 icon indicating copy to clipboard operation
punctuator2 copied to clipboard

pre-train model improvement

Open dickhung opened this issue 6 years ago • 3 comments

I hv used the pre-train model: Demo-Europarl-EN.pcl for punctuation prediction and the result as follow:


PUNCTUATION PRECISION RECALL F-SCORE

b',COMMA 71.89999999999999 75.5 73.7 '

b'.PERIOD 74.2 32.9 45.6 '

b'?QUESTIONMARK 58.3 11.3 18.9 '

b'!EXCLAMATIONMARK nan 0.0 nan '

b':COLON 55.2 26.700000000000003 36.0 '

b';SEMICOLON 33.300000000000004 3.8 6.9 '

b'-DASH 40.6 9.700000000000001 15.7 '


Overall 72.0 55.300000000000004 62.5

Err: 5.86%

SER: 60.7%

With the following config under ubuntu : 16.04

Theano Version: 1.0.4+10.g9feed7868

Python 3.6.8 :: Anaconda, Inc.

Can you advice, how can I improve the performance to reach the baseline model.

Thanks a lot

dickhung avatar May 06 '19 19:05 dickhung

Did you run the evaluation on the Europarl test set or some other kind of text? The performance described in the readme was achieved on the test set of Europarl. It is quite normal that on text that is much different in style compared to Europarl transcripts, the performance declines significantly. Also, some text is more difficult (e.g., interviews with many disfluencies, interruptions and other problems).

ottokart avatar May 08 '19 13:05 ottokart

Thanks for the kindly reply, Actually I run the test on the Europarl dataset as well and the data set file generated exactly as same as the run.sh in the example folder after the unziping the http://hltshare.fbk.eu/IWSLT2012/training-monolingual-europarl.tgz data. I really don’t why the Period and the Question mark degraded for a half f-score by using the Demo-Europarl-EN.pcl

The dataset generation script as follows:

head -n -400000 step2.txt > ./out/ep.train.txt

tail -n 400000 step2.txt > step3.txt

head -n -200000 step3.txt > ./out/ep.dev.txt

tail -n 200000 step3.txt > ./out/ep.test.txt

Thanks a lot.

dickhung avatar May 09 '19 02:05 dickhung

@dickhung Where did you get the step2.txt and step3.txt files? Those aren't present in training-monolingual-europarl.tgz.

chrisspen avatar Aug 07 '20 02:08 chrisspen