punctuator2 icon indicating copy to clipboard operation
punctuator2 copied to clipboard

How to distribute the data to ep.train.txt, ep.dev.txt, and ep.test.txt? what's the purpose of these files?

Open wltz opened this issue 7 years ago • 2 comments

head -n -400000 step2.txt > ./out/ep.train.txt tail -n 400000 step2.txt > step3.txt head -n -200000 step3.txt > ./out/ep.dev.txt tail -n 200000 step3.txt > ./out/ep.test.txt

Hi ottokart, Could you elaborate on how to distribute the data from corpus to these three files? And what's the purpose of these files? I have a small corpus file, 65k lines and about 3M words. So, I need to know how should I distribute the data to these files. Thanks!

wltz avatar Sep 27 '18 15:09 wltz

Hi!

That's quite small dataset. I think I would split it into 80% training, 10% dev and 10% test data. The training file is obviously used for training the parameters of the model; dev set is used for finding good hyperparameters (hidden layer size, learning rate etc...) and the training script uses the score on dev set to decide when to stop training to prevent overfitting; test set is used for final evaluation and should not be touched during the training and development of the model.

ottokart avatar May 08 '19 13:05 ottokart

Where can I find dataset ? and code sys.arg[0] make error in all files

AbdallahQoutbAli avatar Mar 24 '20 18:03 AbdallahQoutbAli