pinot
pinot copied to clipboard
Overall loop for training a deep net for molecules here
We have two desiderata:
- We want to be able to learn a network which regresses to measurements given a structure as input.
- We may want to pretrain parts of that network (i.e. the molecular representation part) with existing molecular data in order to get some knowledge into the model about what molecular structures exist.
We can decompose those two tasks as follows:
We want to have a model of a representation P(h|x)
which predicts hidden features h
from a molecular graph x
, ideally a joint model P(h,x)
that is a joint density with arbitrary conditioning.
We furthermore want a model of the measurements m
we care about given a molecular representation h
, expressed asP(m|h)
. In simple regression this could be a probabilistic linear model on top of the outputs of the representation model P(h|x)
.
We may want to train them both jointly, separately, or in phases.
If we pre-train P(h|x)
, that is called semi-supervised learning.
In a training loop to solve the task of regressing m
from a training set D_t = {X_t, M_t}
, we may want to account for having access to a background dataset D_b = {X_b}
without measurements but with molecular graphs.
The desired training loop now allows us to potentially pre-train or jointly train a model which can learn from both sources of data.
Our targeted output is a model P(m|x) = int_h P(m|h) P(h|x) dh
that is applied to a test set and works much better after having ingested all data available to us.
In this issue/thread, I suggest we link to the code and discuss how to create this loop concretely based on a concrete application example with molecules.
Missing pieces:
- [x] details of the API for training
- [x] details of the API for testing
- [ ] metrics for testing
- [ ] ....
@yuanqing-wang Can you please comment on how this matches your thoughts and if not what we should change in the overall desiderata?
Then we can talk about how you intend to or already have structured this.
@karalets
I like your idea of the overall structure. Separating x -> h and h -> m sounds like a reasonable thing to do since we can play with each part afterwards.
Meanwhile for training and testing I guess we only need something as simple as functions that take models, weights, and return metrics. I'll put stuff here: https://github.com/choderalab/pinot/blob/master/pinot/app/utils.py to incorporate both x -> h and h -> m.
@yuanqing-wang what do you mean by weights? I would abstract away from weights and think about model parameters or model posteriors quite independently, that should be up to a model class to decide what it wants to serialize to a first approximation.
And in utils I don't see the concrete link to dataset creation, i.e. 20% or whatnot. I suggest also accounting for: validation set, data loaders, an interface for passing an arbitray model class with a defined API into the trainer/tester, ...
I would prefer if that function became a thing which has a specified API for this problem which we can pass models into that conform to an API and then push a button and get a few metrics, i.e. test log likelihood.
Could you build a loop an an experiment file which, for the simplest off-the-shelf model does that and does the entire thing?
i.e. it would be good if we get to an abstraction that allows us to define an experiment as follows or similar, my main point is to modularize heavily.
def experiment_1(args):
model = Model1
dataset_train = ...
dataset_background = ....
hyperparameters = args....
out_path = args.experiment_path
#if this is semi-supervised do this
#if it were not semi-supervised there could also be a run_experiment(...) that only does thhe other
stuff or so
results = run_ss_experiment(...)
plot_figures(results, out_path)
And in order to test that there should be from the beginning a concrete instance of such an experiment that one can run.
the data utils are supplied separately here https://github.com/choderalab/pinot/blob/master/pinot/data/utils.py
Cool can we have an experiment file that brings everything together and executes a full example of it all similarly to what I described above?
Working on it
@karalets
I like your idea of the overall structure. Separating x -> h and h -> m sounds like a reasonable thing to do since we can play with each part afterwards.
Meanwhile for training and testing I guess we only need something as simple as functions that take models, weights, and return metrics. I'll put stuff here: https://github.com/choderalab/pinot/blob/master/pinot/app/utils.py to incorporate both x -> h and h -> m.
Just to clarify: I do suggest separating them not necessarily in the model, but rather accounting for the existence of both, so that maybe they are trained separately, maybe jointly, but in any case they need to have a consistent API for the data each part needs to see. In fact, I believe building both into a joint model will work best, but we still need to have datasets in there that can supervise each aspect.
Consider this an instance of data oriented programming, rather than a deep learning model with different phases.
@karalets
I incorporated your idea here: https://github.com/choderalab/pinot/blob/master/pinot/net.py
and the training pipeline now looks like https://github.com/choderalab/pinot/blob/master/pinot/app/train.py
like me know your thoughts on this
@karalets
I incorporated your idea here: https://github.com/choderalab/pinot/blob/master/pinot/net.py
and the training pipeline now looks like https://github.com/choderalab/pinot/blob/master/pinot/app/train.py
like me know your thoughts on this
Great start!
In the Net class I would make the representation and parametrization objects concrete. I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here. I would also suggest not calling the top layer parametrization, but rather something like regressor_layer or measurement model or whatever as opposed to the other component, the more lucid representation_layer or representation that you currently use; parametrization is pretty misleading as a name.
Regarding the loop: I would still recommend factoring out an experiment class which has some more modularity.
I.e. in your current loop you do a lot of things in one larger script: defining the model layers, building the model, training, etc... In a better universe training and experiment setup are factorized out.
Currently, also, unlike the suggestion above, you do not have the potential for semi-supervised learning in there even if you wanted to do use it.
Think about wanting to define an experiment which can differ in the following ways:
- use 20% more training data, but the same settings otherwise
- use or do not use semi-supervised data, same otherwise
- use a particular semi-supervised background dataset or another one, but the same main training set
- try the same data settings but different models
- play with hyperparameter selection for each experiment
- get new metrics for all of the versions of the above when you have pre-trained models lying around
- have new test data that may arrive
- think about a joint model over representation and regression vs a disjoint model, how can you still do all you want?
- ...
Your experiment runner and trainer etc. should make such changes easy and clear, I suggest you think backwards from the results you anticipate wanting to be able to get to the structure here.
As I said, I recommend factoring things out a bit more than you have, ut this is surely a good direction
One can also factor the loop out into:
experiment files contain all the settings (model setting, data settings, model hyper-parameters, storage paths and names for relevant output files) and receive inputs from args
trainer files receive experiment files as args and produce trained objects according to settings
tester files run pre-trained objects on test-data and run eval methods
eval methods receive metrics and predictions according to some API and do stuff that generates numbers
plotting methods visualize eval
We can improve on this I am sure, but I would imagine making this modular will very quickly yield benefits.
One cool example of factorization is the dataloaders etc. in pytorch:
https://pytorch.org/docs/stable/data.html
You can define in separate classes things like:
- the dataset
- the normalization/preprocessing strategy
- the properties of a dataloader which receives dataset class and preprocessing class as inputs
Then objects like this dataloader are passed to trainer classes which tie this to models and deliver batches for training. The dataloader class can be kept invariant to compare all kinds of models while having an auditable 'version' of the training data and pre-processing. In our case, I would like the experimental setup and choices to be auditable by being stored in some experiment definition which can be changed in its attributes for comparing different experiments.
If you prefer not to use as much bespoke pytorch that is fine, I am just suggesting looking at examples of how modern ML software works on separation of concerns.
In the Net class I would make the representation and parametrization objects concrete. I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here.
Not sure if I followed. the objects are taken as parameters here
I'll further factorize the experiment
In the Net class I would make the representation and parametrization objects concrete. I.e. you can play the inheritance game and create explicit classes that inherit from Net and have a concrete form. Else you do not win much here.
Not sure if I followed. the objects are taken as parameters here
Yes, the objects are parameters and that is very nice and would already do if the experiment file factors things sufficiently. An option would be to just create, for each combination of objects, a particular subclass, as other things may also change.
But that is unnecessary for now as we can do all of that later, I am ok with it.
@karalets
Would something like this be a bit better? https://github.com/choderalab/pinot/blob/master/pinot/app/train.py
I am still unsure if you can do the cases described below.
Think about wanting to define an experiment which can differ in the following ways:
- use 20% more training data, but the same settings otherwise
- use or do not use semi-supervised data, same otherwise
- use a particular semi-supervised background dataset or another one, but the same main training set
- try the same data settings but different models
- play with hyperparameter selection for each experiment
- get new metrics for all of the versions of the above when you have pre-trained models lying around
- have new test data that may arrive
- think about a joint model over representation and regression vs a disjoint model, how can you still do all you want?
- ...
These could be done by simply changing some args in the script. -use 20% more training data, but the same settings otherwise -try the same data settings but different models -play with hyperparameter selection for each experiment
The rest would can be done by using the APIs but with small twists in the scripts.
Ok, could you run a test-playthrough with an off-the-shelf semi-supervised model, i.e. the one from the paper?
Semi-supervised learning has not been implemented yet. Should that be our next step?
I believe it serves the utility of making the pipeline more complete and step 1 should be to have a robust skeleton of the pipeline and examples of the types of workflows we may need.
I think you will understand my asks for more modularization a bit better when you build semi-supervised in there.
Thus: yes, let's proceed to having an example of SS training.
Ideally you could make two examples: one with and one without SS aspects, both using the same training data and as much of the same infrastructure as possible. I.e. ideally the differences only live in the arguments passed to the experiment code.
Hey @yuanqing-wang , do we have at this point a little toy/sandbox example that one could test and run on a laptop in a closed loop? I'd like to play with some of the problems with NN training in a toy example that is easy to re-run.
Not quite yet, I think. We have the beginnings of this, but I think we're hoping @dnguyen1196 can dive in and get this part going!
I am tagging @dnguyen1196 here to read through the beginning as this issue explains a lot of what is going on here.
@karalets @yuanqing-wang
To recap and please correct me, it seems that the goals when this issue was created were:
- Implement functionalities that can do testing with detailed specifications (https://github.com/choderalab/pinot/issues/3#issuecomment-605695377) and the current implementation covers some basic requirements.
- Per this issue, one major way we might want to change from the current implementation is the ability to more deeply separate between parameterization and representation (because it seems at the moment they are jointly trained). We want functionalities such as pre-training representations, combining fixed representations and trainable parameterization, pretrained representations with trainable parameterization etc
So within this issue, perhaps two subtasks remain:
- Add more fine-grain testing capabilities to the current experiment infrastructure
- More cleanly separate between parameterization and representation.
Hey,
So you understand the issue here quite well. There are some subtleties with respect to how to specify remaining subtasks.
So within this issue, perhaps two subtasks remain:
- Add more fine-grain testing capabilities to the current experiment [infrastructure]
Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together. In addition, I would argue, as mentioned in issue #26 , that we should also first individually test components that would do unsupervised or self-supervised learning to learn representations so we can target a reasonable set of things to plug in here. However, in the literature we oftentimes also consider this thing a joint training process as a graphical model which sometimes has more or less evidence at some of the invovled variables, see for instance this https://arxiv.org/abs/1406.5298 and newer literature along those lines https://arxiv.org/abs/1706.00400 .
(https://github.com/choderalab/pinot/blob/master/pinot/app/experiment.py)
- More cleanly separate between parameterization and representation.
I would not go that route quite yet, I would prefer to be agnostic if the model makes these things communicate uncertainty or not. There may be model classes that have their own way of incorporating one or more variables.
Imagine you have a net
-class which has a method net.train(X, Y)
and when you set Y=None
it just updates the parts it needs.
Another model may really be to hackily pretrain two seperate objectives, one just for representation and one for the measurement term, which are then pliugged together correctly according to the degree of supervision in the observed tupel.
The shared API int he infrastructure should make both types of workflows useable, so I would focus on that API and infrastructure first with a concrete example with real data.
I envision that first pre-training some representation based on background data and then finetuning it on labeled data is ok as a start, but keep in mind we may want to train jointly later with a more rigorous treatment of semi-supervised learning.
We should discuss and iterate on a concrete version of this more, but we also need a separate process to just evaluate the different unsupervised models as mentioned in #26 .
Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together.
What do you mean by this @karalets ? Is the following interpretation correct? For example, say we have 1000 compounds with their associated properties. We actually first use this as "background" data where we train, for example, an unsupervised representation so that we get a "reasonable" representation first (and not touch the parameterization). And then after we have obtained this reasonable representation, we train both the representation and parameterization jointly on the prediction task (supervised).
Absolutely correct. We need to be able -as I have described above- to add "background" data to inform the representation and train the whole thing nicely together.
What do you mean by this @karalets ? Is the following interpretation correct? For example, say we have 1000 compounds with their associated properties. We actually first use this as "background" data where we train, for example, an unsupervised representation so that we get a "reasonable" representation first (and not touch the parameterization). And then after we have obtained this reasonable representation, we train both the representation and parameterization jointly on the prediction task (supervised).
Sorry, to be precise: By "background data" I mean data for which we only have graphs, not the measurements/properties, i.e. background molecules that are not the data we are collecting measurements for, but we know exist as molecules.
Intuitively: we need we need graphs to train "representations", and matched 'measurements' to train likelihoods/observation terms ("parametrizations" although I prefer to fade this term out).
In my world we could consider all this to be training data, but sometimes we only observe X
, sometimes we observe the tupel X,Y
to train our models and we want to make the best of both.
@karalets @yuanqing-wang
Intuitively: we need we need graphs to train "representations", and matched 'measurements' to train likelihoods/observation terms ("parametrizations" although I prefer to fade this term out). In my world we could consider all this to be training data, but sometimes we only observe
X
, sometimes we observe the tupelX,Y
to train our models and we want to make the best of both.
Ok I see your point now. In that regard, I think we might need to modify two interfaces, let me know what you think and if I should start a new issue/discussion on this.
- Net
Right now
net.loss(g, y)
takes in two arguments.
def loss(self, g, y):
distribution = self.condition(g)
return -distribution.log_prob(y)
So we can modify this function so that for the case when y = None
, we only compute "loss" for the representation layer.
- For the
experiment.py
interface, I think we have two options
2a. Add TrainUnsupervised, TestUnsupervised, etc (basically for every current supervised training/testing class, we need a corresponding class for unsupervised training). This will probably repeat a lot of codes but supervised and unsupervised training will involve different optimizers, potentially very different choice of hyperparameters. So if we have separate unsupervised and supervised classes, we can have another class that can combine supervised and unsupervised components together.
2b. Modify the current Train and Test class so that it accommodates both unsupervised and supervised training. This will involve modifying the current constructor to take in more arguments (optimizer for unsupervised training vs supervised training, hyperparameters for unsupervised training). And within the class implementation, more care is needed to make sure the training/testing steps are in the right order.
I think 2a is better, although we repeat more codes but the modularity allows us to do more fine grain training/testing.