How to distribute the data to ep.train.txt, ep.dev.txt, and ep.test.txt? what's the purpose of these files?
head -n -400000 step2.txt > ./out/ep.train.txt tail -n 400000 step2.txt > step3.txt head -n -200000 step3.txt > ./out/ep.dev.txt tail -n 200000 step3.txt > ./out/ep.test.txt
Hi ottokart, Could you elaborate on how to distribute the data from corpus to these three files? And what's the purpose of these files? I have a small corpus file, 65k lines and about 3M words. So, I need to know how should I distribute the data to these files. Thanks!
Hi!
That's quite small dataset. I think I would split it into 80% training, 10% dev and 10% test data. The training file is obviously used for training the parameters of the model; dev set is used for finding good hyperparameters (hidden layer size, learning rate etc...) and the training script uses the score on dev set to decide when to stop training to prevent overfitting; test set is used for final evaluation and should not be touched during the training and development of the model.
Where can I find dataset ? and code sys.arg[0] make error in all files