tabzilla icon indicating copy to clipboard operation
tabzilla copied to clipboard

NODE path error

Open duncanmcelfresh opened this issue 3 years ago • 5 comments

something about saving model files or checkpoints

Traceback (most recent call last):
  File "/home/shared/tabzilla/TabSurvey/tabzilla_experiment.py", line 136, in __call__
    result = cross_validation(model, self.dataset, self.time_limit)
  File "/home/shared/tabzilla/TabSurvey/tabzilla_utils.py", line 237, in cross_validation
    loss_history, val_loss_history = curr_model.fit(
  File "/home/shared/tabzilla/TabSurvey/models/node.py", line 174, in fit
    self.trainer.load_checkpoint(tag="best")
  File "/home/shared/tabzilla/TabSurvey/models/node_lib/trainer.py", line 73, in load_checkpoint
    checkpoint = torch.load(path)
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 231, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 212, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'logs/openml__arrhythmia__5_2022.09.16_20:21:07/checkpoint_best.pth'

duncanmcelfresh avatar Sep 16 '22 20:09 duncanmcelfresh

The issue seems to occur when logging_period is greater than the number of epochs. Checkpointing is only done periodically (according to logging_period). I have changed the behavior so that checkpointing occurs at the last iteration, and at multiples of logging_period before. This ensures checkpointing will occur at least one time.

However, we need to be careful when selecting hyperparameters for this method. If logging_period is too high, then the validation loss is not tracked often, and we could run the risk of overfitting more easily. If it is too low, then too many checkpoints are saved, resulting in possible storage issues. (We can fix this separately as well by having the model eliminate old checkpoints).

When fixing this issue, I did run into issues related to #27. I have implemented a fix for that in the latest commit as well. After fixing these two issues, I am able to run a full trial for NODE on openml__arrhythmia__5 without a problem.

jonathan-valverde-l avatar Sep 30 '22 22:09 jonathan-valverde-l

Have you checked if creating the logs directory manually solve the issue?

suj97 avatar Oct 03 '22 16:10 suj97

oh nvm, looks like it's fixed.

suj97 avatar Oct 03 '22 16:10 suj97

@jonathan-valverde-l resolved this with a previous commit

duncanmcelfresh avatar Oct 12 '22 20:10 duncanmcelfresh

NODE needs to be tested

duncanmcelfresh avatar Oct 17 '22 16:10 duncanmcelfresh