char-rnn icon indicating copy to clipboard operation
char-rnn copied to clipboard

Question about the training loss and validation loss.

Open fluency03 opened this issue 8 years ago • 2 comments

As you have said in the following:

If your training loss is much lower than validation loss then this means the network might be overfitting. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.

If your training/validation loss are about equal then your model is underfitting. Increase the size of your model (either number of layers or the raw number of neurons per layer)

The first part is quite clear. Regarding the second part, my question is:

If training loss << validation loss, it is overfitting; if roughly training loss = validation loss, it is underfitting. Then ,what is the balanced situation? Is it training loss > validation loss or training loss is lower but not much lower than validation loss?

I do not think training loss > validation loss will happen right?

fluency03 avatar Mar 29 '16 11:03 fluency03

I have the same question.

I have however gotten my Training Loss > Validation Loss by increasing the Dropout to > 0.8 though that did cause the Validation Loss/Training Loss about twice as long (2x the epochs) to reach the minimum Validation Loss I could get.

I also have a follow-up. What is a good Validation Loss to get to for decent generation of data (I know this could be different for a given data set). No matter what variables I change I can't get my lowest Validation Loss < 0.5. Most of the time the Validation Loss will get close to 0.5 then start going back up. This would suggest that I'm Overfitting if I'm not mistaken.

About my data: I have a 1Mb text file (all magic cards stripped down to useful information in json format, you can view it here). I'd like to have a bigger data set, but this is already all the cards ever produced. When running with any given variables I get about 3million - 5million parameters. It also takes only about 15 - 20 epochs to get the Validation Loss to around 0.5 before it won't go any lower, or starts going back up. Each "card" is between 100 - 400 characters long. The cards have been pre-shuffled (mainly so like colored cards are not next to eachother).

The "best" run I've done is the following (lowest Validation Loss): th train.lua -data_dir data/mtg/ -num_layers 3 -rnn_size 512 -seq_length 300 -train_frac 0.95 -val_frac 0.05 -max_epochs 20 -seed $RANDOM -batch_size 25 -eval_val_every 200 -dropout 0.5 This produced a final Validation Loss of 0.4969 after the full 20 epochs (with the previous 7 epochs all being around 0.5).

All of my test have been on the base data, meaning I haven not been running it with the -init_from command on previous runs. The few times I have tried this the Training Loss either goes out of whack right away, or it doesn't produce any better minimum Validation Loss. Would running from previous save locations help/be any different than running the code for longer? I have the time and power to run this over thousands of epochs. But so far that hasn't seemed to help.

It's hard to really tell if any of my .t7 files are any better than the rest as so far they're fairly comparable. And it's not as though the "cards" it produce are really that "bad". But there are some patterns that I would like to code to pick up on. Like when cards reference themselves, the generated code never produces a card that references it's own name (it'll put some other random name instead). Or cards with bulletin points have "Chose one or both" before them, but none of the generated cards have this. I know this also has to do with the Temperature when sampling. I've found that anything with a Temperature below 0.5 only creates very rudimentary cards. And anything above 0.9 creates mostly gibberish. I've been generating all of my cards at 0.7. th sample.lua -length 5000 -temperature 0.7 -primetext "{\"Name\":\"Storm Crow\"," cv/lm_lstm_epoch20.00_0.4969.t7

drohack avatar May 05 '16 03:05 drohack

I think this stackoverflow answer covers the confusion.

calicratis19 avatar May 02 '18 06:05 calicratis19