NODE path error
something about saving model files or checkpoints
Traceback (most recent call last):
File "/home/shared/tabzilla/TabSurvey/tabzilla_experiment.py", line 136, in __call__
result = cross_validation(model, self.dataset, self.time_limit)
File "/home/shared/tabzilla/TabSurvey/tabzilla_utils.py", line 237, in cross_validation
loss_history, val_loss_history = curr_model.fit(
File "/home/shared/tabzilla/TabSurvey/models/node.py", line 174, in fit
self.trainer.load_checkpoint(tag="best")
File "/home/shared/tabzilla/TabSurvey/models/node_lib/trainer.py", line 73, in load_checkpoint
checkpoint = torch.load(path)
File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 231, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/serialization.py", line 212, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'logs/openml__arrhythmia__5_2022.09.16_20:21:07/checkpoint_best.pth'
The issue seems to occur when logging_period is greater than the number of epochs. Checkpointing is only done periodically (according to logging_period). I have changed the behavior so that checkpointing occurs at the last iteration, and at multiples of logging_period before. This ensures checkpointing will occur at least one time.
However, we need to be careful when selecting hyperparameters for this method. If logging_period is too high, then the validation loss is not tracked often, and we could run the risk of overfitting more easily. If it is too low, then too many checkpoints are saved, resulting in possible storage issues. (We can fix this separately as well by having the model eliminate old checkpoints).
When fixing this issue, I did run into issues related to #27. I have implemented a fix for that in the latest commit as well. After fixing these two issues, I am able to run a full trial for NODE on openml__arrhythmia__5 without a problem.
Have you checked if creating the logs directory manually solve the issue?
oh nvm, looks like it's fixed.
@jonathan-valverde-l resolved this with a previous commit
NODE needs to be tested