pytorch-transformers-classification icon indicating copy to clipboard operation
pytorch-transformers-classification copied to clipboard

Validating the model

Open Magpi007 opened this issue 6 years ago • 7 comments

Hi,

I would like to know if the model over-fits and also the optimum number of epochs, plotting accuracy and loss as it's shown here. It would be possible to do it using this repo without making too many changes (maybe using the evaluation results as validation)?

Thanks.

Magpi007 avatar Oct 03 '19 09:10 Magpi007

You can get the training loss without any changes. You can use tensorboardx to get a graph of the training loss. The loss information is being written to the 'runs' directory.

If you want to evaluate on the dev set during training, you can set evaluate_during_training to True in the args dict.

If you want to add additional information to that, you can use additional tb_writer.add_scalar() calls inside the train function.

ThilinaRajapakse avatar Oct 03 '19 09:10 ThilinaRajapakse

The concept is simple but I am still not able to plot anything.

In the training function we have these lines:

tb_writer.add_scalar('eval_{}'.format(key), value, global_step) tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step) tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args['logging_steps'], global_step)

If I understand it right, for each epoch, the first one contains every result that we have generated, and the second and third ones the learning rate and loss.

For this case I am going to plot this info only with one epoch, but it should still show something. As per the documentation, I understand that we only need to launch this line tensorboard --logdir runs (as we store the scalars in the runs directory), am I right?

I get no error message in any point of the implementation (having activated the option evaluate_during_training), but when I try to plot it I get this error:

Untitled

There is a folder called runs in the experiment folder.

Magpi007 avatar Oct 09 '19 03:10 Magpi007

There should be a subdirectory inside runs for every training run. So your command would look like tensorboard --logdir=runs/subdirectory.

To visualize the last run, you can use the line below. tensorboard --logdir=$(ls -td | head -1)

ThilinaRajapakse avatar Oct 09 '19 03:10 ThilinaRajapakse

There is a directory called Oct09_03-14-56_31dd366812b4, but when I run this line:

!tensorboard --logdir="runs/Oct09_03-14-56_31dd366812b4" --host localhost --port 8088

I get a message saying that site http://localhost:8088/ can't be reached, localhost refused to connect.

I have tried with different ports but no way. I have been researching on the internet and some people says that it's possible to achieve this using a tunnel, ngrok, here. Before trying it I would like to ask you if it makes sense, or if it should work straightforward from google colab.

Magpi007 avatar Oct 09 '19 04:10 Magpi007

Supposedly wouldn't be needed...

https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks

Magpi007 avatar Oct 09 '19 05:10 Magpi007

ok, I think I got it.

First I loaded the TensorBoard notebook extension %load_ext tensorboard

And run the tensorboard with: %tensorboard --logdir=runs/Oct09_03-14-56_31dd366812b4

So I got the dashboard.

Untitled

I need to play with it a bit more to see if it's working, but looks it is.

Magpi007 avatar Oct 09 '19 05:10 Magpi007

Great to see you got it to work. I didn't realize you were on Colab!

ThilinaRajapakse avatar Oct 09 '19 07:10 ThilinaRajapakse