torchmd-net icon indicating copy to clipboard operation
torchmd-net copied to clipboard

Log epoch real time in LNNP

Open RaulPPelaez opened this issue 1 year ago • 7 comments

This PR changes the LNNP module so that the real time since the start is logged each epoch. Allows to track training time with the CSVLogger.

RaulPPelaez avatar Oct 10 '23 09:10 RaulPPelaez

@AntonioMirarchi please review

RaulPPelaez avatar Oct 10 '23 10:10 RaulPPelaez

It works fine. This is the time column form the metrics.csv:

0      4.779309
1      6.532793
2      8.925880
3     10.662247
4     13.062045
5     14.766587
6     17.175591
7     18.913141
8     21.361235
9     23.063034
10          NaN
Name: time, dtype: float64

The Nan on the last row is not clear to me

AntonioMirarchi avatar Oct 10 '23 11:10 AntonioMirarchi

I think the Nan is due to reaching the max num epochs by the trainer, so it's not starting a new epoch but it's creating a row in the metrics regardless of it.

AntonioMirarchi avatar Oct 10 '23 12:10 AntonioMirarchi

I am only logging the time in on_validation_epoch_end and on_test_epoch_end. Perhaps there is some other place where data is logged?

RaulPPelaez avatar Oct 10 '23 16:10 RaulPPelaez

Antonio can you give me a yaml to reproduce this NaN thing you see? I cannot trigger it.

RaulPPelaez avatar Oct 11 '23 09:10 RaulPPelaez

I think the NaN Antonio reports comes from the test pass at the end of a train. It is considered a new "training" and for some reason the first time has NaN.

RaulPPelaez avatar Jan 19 '24 10:01 RaulPPelaez

Tried to change the logging of the epoch to integer, but this produces a warning:

 You called `self.log('epoch', ...)` in your `on_validation_epoch_end` but the value needs to be floating to be reduced. Converting it to torch.float32. You can silence this warning by converting the value to floating point yourself. If you don't intend to reduce the value (for instance when logging the global step or epoch) then you can use `self.logger.log_metrics({'epoch': ...})` instead.

There is some (really unconvincing imo) discussion on why one cannot log an integer https://github.com/Lightning-AI/pytorch-lightning/issues/18739 The alternative suggested in the warning itself does not work as one would expect, you get a new line in metrics.csv in which everything is empty except epoch. cc @peastman

RaulPPelaez avatar Jan 19 '24 11:01 RaulPPelaez