Model versioning
Currently, deepnog ships one model per eggnog level and network architecture. If we ever decide to retrain certain models, users need to individually come up with strategies to tell models apart, or use a specific model (e.g., for reproducibility), such as manually moving files around, renaming accordingly, etc. Retraining, however, could sometimes make sense. For example, we might want to use different data splits, increase the share of training sequences compared to test sequences to squeeze a little more performance out of the model.
We should at least introduce some versioning, model identifiers, etc., that are stored with the model. Could be a simple string inside the model_dict. This could even be "backported" to existing models.
Ideally, automatic model download should also be version-aware. Currently, a user that already has downloaded a model will not receive any updated model.
To summarize some key points of the recent discussion:
Models will receive a metadata field that holds the following information,
- UUID as model identifier
- Date & timestamp of training
- training params (incl. learning rate, scheduler, number of epochs, etc.)
- Orthology DB name
- Taxonomic level in DB
- metadata format version (v1 for now, v2 if this ever needs to be extended) Technically, this can be implemented as a dict that is serialized into the .pth model file. This can be backported to old models.
Model filenames obtain a version hint, e.g. the date, or v1, v2, etc., and a "latest" pointer to the most up-to-date version.
The client subcommand deepnog infer will use a use_latest boolean flag to use the latest model (otherwise, the one currently installed). A warning/info could be issued to users, when new models are available.