darts icon indicating copy to clipboard operation
darts copied to clipboard

epochs_trained attribute not working (perhaps a bug)

Open Allena101 opened this issue 1 year ago • 1 comments

I think there could be a bug with epochs_trained (or i am using it wrong)

I always get 0 when i access epochs_trained.

model = NBEATSModel(
    input_chunk_length=48,
    output_chunk_length=24,
    n_epochs=15,
    activation='LeakyReLU'
)
model.fit(ts['KWhT4'])

# returns 0 when it should be 15
model.epochs_trained

I seem to have the same issue with other models in darts as well.

Also, as I understand it , i should be able to run model.fit() several times to increase training time, so then getting access to the epochs_trained would be helpful

Allena101 avatar Feb 22 '24 17:02 Allena101

Hi @Allena101,

Running your code snippet returns the expected result: the epochs_trained attributes is 15. However, if you call predict() or fit() again, a new PyTorch-Lightning trainer is created and the number of epochs is reset to 0 (see #1922), hence the confusion.

If you want to retrain a model, it's recommended to load the weights from a checkpoint using load_weights_from_checkpoint() instead of calling fit() on the model repeatedly (see user guide).

madtoinou avatar Feb 23 '24 08:02 madtoinou

Hi @Allena101,

Running your code snippet returns the expected result: the epochs_trained attributes is 15. However, if you call predict() or fit() again, a new PyTorch-Lightning trainer is created and the number of epochs is reset to 0 (see #1922), hence the confusion.

If you want to retrain a model, it's recommended to load the weights from a checkpoint using load_weights_from_checkpoint() instead of calling fit() on the model repeatedly (see user guide).

Hello, thanks for taking a look at my issue madtoinou!

I did try it several times and i even updated darts uinsg --upgrade git+https://github.com/unit8co/darts , but it did not work.

Then today when i tried it again it worked as intended so i have no idea what i did wrong before. I appologize for waisting your time on this particular issue.

So with load_weights_from_checkpoint, does that mean that if i want to train my model for 10 epochs the first day and then a further 10 epochs the next day, that i then have to save the model (persist it), and the next day load the model from the latest checkpoint?

Allena101 avatar Feb 28 '24 22:02 Allena101

in the notebook you linked, both examples manually create a new model and then load the weights from the previously saved model using load_weights_from_checkpoint or load_weights. So there is now way to just load the saved model and continue to train it? not fine tuning or re-traing , just continued training (i.e. continued from where the models last left off).

I realize that creating an identical model and then loading the latest weights (best = False), would accomplish the same thing, but that entails that you have knowledge of the previously saved models structure.

Allena101 avatar Feb 29 '24 13:02 Allena101

It's possible but it becomes with no guarantees about the model attributes "correctness" because a new Trainer will be created and some of the model's attributes are going to reference it (epochs_trained being one of them) so you will have to keep track of these yourself.

An example of something else that could possibly "go wrong" after calling fit() consecutively : if you try to plot your loss/learning rate over the epochs, some values on the x axis will be duplicated and you will be responsible for shifting them appropriately to obtain the plot over the entire training process.

madtoinou avatar Feb 29 '24 14:02 madtoinou

It's possible but it becomes with no guarantees about the model attributes "correctness" because a new Trainer will be created and some of the model's attributes are going to reference it (epochs_trained being one of them) so you will have to keep track of these yourself.

An example of something else that could possibly "go wrong" after calling fit() consecutively : if you try to plot your loss/learning rate over the epochs, some values on the x axis will be duplicated and you will be responsible for shifting them appropriately to obtain the plot over the entire training process.

I understand and your answer makes sense. Is there some way to 'know' that you have successfully trained the new model from the old models' checkpoints? Besides monitoring that the loss is lower or that you get better back testing scores. I am kind of just wanting some sanity checks that I am not somehow re-training the model instead of continuously training it.

Allena101 avatar Mar 04 '24 04:03 Allena101

I would say that monitoring the loss seems like the best and simplest way to make sure that the training is not starting over and that "fine-tuning/retraining" is indeed happening.

madtoinou avatar Mar 04 '24 08:03 madtoinou