nnpdf icon indicating copy to clipboard operation
nnpdf copied to clipboard

Avoiding duplicated computations by having a single observable model

Open APJansen opened this issue 2 years ago • 34 comments

Goal

The goal of this PR is to speed up the code by a factor 2 by a refactoring that avoids redoing the same computations. Currently there are separate training and validation models. At every training step the validation model is run from scratch on x inputs, while the only difference with the training model is in the final masking just before computing the loss.

This will hopefully also improve readability. From an ML point of view the current naming is very confusing. Instead of having a training model and a validation model, we can have a single observable model, and on top of that a training and validation loss. (Just talking about names here, they may still be MetaModels)

The same holds of course for the experimental model, except that there is no significant performance cost there. But for consistency and readability let's try to treat that on the same footing.

This PR branches off of trvl-mask-layers because that PR changes the masking. That one should be merged before this one.

Current implementation

Models creation

The models are constructed in ModelTrainer._model_generation. Specifically in the function _pdf_injection, which is given the pdfs, a list of observables and a corresponding list of masks. For the different "models", both the values of the mask but also the list of observables change, as not all models use all observables, in particular the positivity and integrability ones. This function just calls the observables on the pdfs with the mask as argument. And each observable's call method, defined here, does two steps: 1. compute the observable, 2. apply the mask and compute the loss.

Models usage

Once they are created, the training model is, obviously, used for training here. The validation model used to initialize the Stopping object. The only thing that happens there is that its compute_losses method is called. Similarly for the experimental model, where it is called directly in the ModelTrainer (here).

Changes proposed

  1. Decouple the masking and loss computation from the ObservableWrapper class. Just remove those parts from ObservableWrapper, and create perhaps an ObservableLoss layer that does this.
  2. Apply this pure observable class to the pdfs, for all observables, to create an observables_model.
  3. Create 3 loss models, that take as input all observables, do both a masking and a selection and a computation of losses.
  4. For the training one, put it on top of the observables_model, to create a model identical to the current training model.
  5. Add the output of the observables_model to the output list of this training model, so these can be reused.
  6. The validation and experimental models can be discarded, instead we have the validation and experimental losses that are applied to the output of the observables_model. So e.g. we can replace self.experimental["model"].compute_losses() with experimental_loss(observables).

APJansen avatar Nov 20 '23 12:11 APJansen

I'm looking into the first point, decoupling the computation of the observables from their masking and loss. Some questions @goord

Currently in _generate_experimental_layer this happens:

  1. observables computed
  2. masks applied one by one
  3. masked observables concatenated
  4. some rotation
  5. loss computed

What does this rotation do?

And is it possible to change this to (at the cost of concatenating masks inside observable_generator): Here:

  1. observables computed
  2. UNmasked observables concatenated
  3. (rotation?)

subsequent loss layer:

  1. masks applied to concatenated observables
  2. loss computed

Also I've probably seen this before but still I'm confused why there is both a mask applied directly to the observables, in _generate_experimental_layer, and also inside the LossInvcovmat itself?

APJansen avatar Nov 27 '23 15:11 APJansen

Yes these rotations are triggered by the 'data_transformation_tr', which is used if you represent the experimental data in a covariance-diagonal basis I guess. I'm not sure when this is actually used, and I'm not sure whether this code path is propoerly tested in the trvl-mask-layers branch...

goord avatar Nov 27 '23 15:11 goord

The mask in LossInvCovmat is not used for masking training/validation I think.

goord avatar Nov 27 '23 15:11 goord

@goord Why does the experimental output have rotation=None while the others have rotation=obsrot, around here? Is that intentional? It interferes a bit with how I thought the observable computation would decouple from the masked loss.

APJansen avatar Jan 12 '24 08:01 APJansen

@goord Why does the experimental output have rotation=None while the others have rotation=obsrot, around here? Is that intentional? It interferes a bit with how I thought the observable computation would decouple from the masked loss.

This is a rewrite of line 289 in the master. I don't know why the diagonal basis is not used for the experimental output layer, perhaps @scarlehoff or @RoyStegeman can explain us.

If you look at n3fit_data.py you can see that in the diagonal basis, the training and validation covmats are being masked and then inverted, but the full covmat inverse (inv_true) is computed in the old basis.

goord avatar Jan 12 '24 08:01 goord

Because when they were separated it didn't really matter and it is decoupled from training / validation (the idea of diagonalising is to be able to do the split removing the correlations within a dataset between training and validation).

scarlehoff avatar Jan 12 '24 09:01 scarlehoff

Hm I don't fully understand, but is it ok to uniformize this? I now calculate all observables without any mask once, so using the same settings, and then mask the folds and the tr/val split afterwards. It's passing all the tests and also giving identical results for the main runcard.

APJansen avatar Jan 12 '24 09:01 APJansen

Hm I don't fully understand, but is it ok to uniformize this? I now calculate all observables without any mask once, so using the same settings, and then mask the folds and the tr/val split afterwards. It's passing all the tests and also giving identical results for the main runcard.

You can try the diag-DIS runcard to check the observable rotation: DIS_diagonal_l2reg_example.yml.txt

goord avatar Jan 12 '24 09:01 goord

Seems to work fine, and gives the same results as trvl-mask-layers.

APJansen avatar Jan 12 '24 10:01 APJansen

I don't fully understand

The chi2 (should not) depend of the diagonalization, since the total covmat is only used to report the total chi2, nobody cared about diagonalising that because it was not needed.

but is it ok to uniformize this?

Yes because see above.

scarlehoff avatar Jan 12 '24 10:01 scarlehoff

Ok perfect, thanks :)

APJansen avatar Jan 12 '24 10:01 APJansen

@scarlehoff @Radonirinaunimi Positivity is included in the validation model, I remember we discussed this before, and if I remember correctly there was some disagreement on whether this was necessary or not, is that right? If I remove it, I get an error from this line, which can be fixed by changing fitstate.validation to fitstate._training, after which it runs normally (though I haven't done any comparisons).

Right now I'm thinking that to remove the repeated calculation of observables, the easiest is to combine the training and validation models into one model that computes both of their losses, adding a "_tr" and "_val" postfix and filtering as appropriate when summing to get the final train/val losses. The experimental one can stay separate as the performance loss there is negligible.

Does that sound ok?

Of course it would be nicer to instead just have one model and 3 different losses, but that will take longer to implement.

APJansen avatar Jan 12 '24 12:01 APJansen

I don't understand what you mean. The easiest way looks much more complex to me since you need to filter out things and any bug there will "break" the validation.

scarlehoff avatar Jan 12 '24 12:01 scarlehoff

Also, I'm not completely sure you can achieve your goal here?

You need to compute everything twice for every epoch just the same.

scarlehoff avatar Jan 12 '24 12:01 scarlehoff

What I mean is we would have one model of the form (say we only have the DEUTERON observable) x -> pdf -> DEUTERON -> (DEUTERON_tr, DEUTERON_val), where the tuple are two separate layers containing the respective losses, and DEUTERON by itself is the full observable, without any mask (the computation up to and including that layer is the one we don't want to repeat).

This I think is what requires the least changes. I haven't worked it all out yet, but in the end the Stopping class won't need a separate validation model, and we don't need this compute_losses in MetaModel, it should just receive all the losses, both train and val, inside the history dict. And the default_loss defined above MetaModel would need to check if the key ends in "_tr" and give 0 otherwise.

APJansen avatar Jan 12 '24 12:01 APJansen

In this PR I've already decoupled the computation of the observable from the masking+loss, that was quite simple and gives identical results. The tricky part is how to use that to avoid this repeated computation of the observable (and PDF).

APJansen avatar Jan 12 '24 12:01 APJansen

where the tuple are two separate layers containing the respective losses

Yes, but your goal is to reduce the number of calls. However, you will need to call the model once for the optimization. Then the weights are updated. Then you call it a second time to compute the validation for that step. Then you check whether the positivity passes and whatsnot.

So there is no way to avoid the repeated computation of the observable.

scarlehoff avatar Jan 12 '24 12:01 scarlehoff

Ah I hadn't thought about that, you're right that conventionally the validation at step t is computed after training for t steps. My proposal would have a shift by one step (epoch) with respect to this convention, in effect computing the validation losses from step t-1 at step t. But I don't think that's a big deal right? Changes should be tiny from one step to the next. This should give a 50% speedup (not 100% because the validation only has a forward pass while the training also has a backward pass), I think that's very much worth it.

APJansen avatar Jan 12 '24 13:01 APJansen

It is a big deal because that tiny change can move you from a physical to an unphysical situation by means of positivity, integrability and probably even normalisation.

But also, in general, it's simply not correct since the epoch at which you wanted to stop was the previous one.

scarlehoff avatar Jan 12 '24 13:01 scarlehoff

True, but that should be easy to solve. Just save the weights at every step and when the stopping condition hits, instead of just stopping, revert to the previous epoch. The weights are tiny so I think that shouldn't be a problem.

APJansen avatar Jan 12 '24 13:01 APJansen

Check the speed up you would get in a fit and whether it doesn't become much more complicated.

The forward pass is not the heaviest part of the calculation and a 50% speedup there won't translate to a speed up of the entire fit. So unless it makes a big different I'm strongly against adding this. It adds a lot of bug-cross section and complicates quite a bit using custom optimizers.

scarlehoff avatar Jan 12 '24 13:01 scarlehoff

This old tensorboard profile illustrates the speedup. The gaps will be mostly removed by epochs-to-batches. Of the rest the validation step is more than 50% of the training step. image

While I didn't think of the issue you mentioned, I still think it should be possible with minimal addition of complexity (it removes some and adds some). I'll have a go and then we can see if it's worth it.

APJansen avatar Jan 12 '24 13:01 APJansen

If the final code is not very complex I'd be happy with it. From what you explain in the comments it looks complicated. Specially the idea of the internal filtering of losses.

The tensorboard profile is not enough to know what would be the effect on the fit (specially if it is old, many things have changed in the meanwhile). Note that you will still need to wait and check positivity, check the chi2 etc.

Btw, going back to the experimental model. Note that the comparison data in the experimental model is different from training/validation so it doesn't really matter how you do it, you need to recompute the losses for that one.

scarlehoff avatar Jan 12 '24 13:01 scarlehoff

Yes the experimental one should stay separate I agree, it also doesn't play any role in performance.

And I agree the approach I proposed is definitely not the cleanest. What I had in mind before, and maybe I can still try that, is to have the model stop at the unmasked observables, and put the remaining layers in an actual loss as seen by Keras, which can be any callable. Now that I think about it, perhaps then the validation losses can be implemented as a metric (which can also be any callable). I'll look into it.

APJansen avatar Jan 12 '24 13:01 APJansen

Update

I made some progress, and have good hopes that a much nicer implementation than I suggested above is possible. (Just hope still because I haven't looked at reverting to the previous epoch's parameters).

As I suggested before, the structure would be:

  • ModelTrainer.observables_model: computes all observables, without any mask (including the experimental ones)
  • training/validation/experimental are reduced to losses, more precisely, each of the 3 is a dictionary from experiment names to MetaModels that take as input the corresponding observable, and output the per replica losses for that experiment, with the appropriate tr/vl mask applied.

They are wrapped by a simple function that 1. adds a dummy argument that tensorflow needs (for y_true), and 2. sums over replicas, as is currently done. The training losses are passed as a loss argument and the validation as a metric argument to the model's compile method. This is already working, and gives logs of the form, for instance:

{'loss': 109752.359375, 'CMS_loss': 25598.076171875, 'DEUTERON_loss': 23445.390625, 'INTEGXT3_loss': 2.059400117104815e-08, 'POSF2U_loss': 54864.96875, 'POSFLL_loss': 150.9700927734375, 'SLAC_loss': 5692.95068359375, 'CMS_val_loss': 6572.11181640625, 'DEUTERON_val_loss': 4083.84716796875, 'POSF2U_val_loss': 54864.96875, 'POSFLL_val_loss': 150.9700927734375, 'SLAC_val_loss': 5784.57958984375}

Note that metrics naturally have access to y_pred, so I didn't have to do anything complicated.

What I have to do still is modify Stopping, which I think should be split into (a) metric(s) that compute the losses, and a callback that uses them and just does the stopping logic based on those, perhaps just as a subclass of Keras's Stopping callback.

Especially this latter part may be more work than I hoped, but I'm pretty convinced it will end up more readable than it started. I'll first look into the resetting of weights though.

APJansen avatar Jan 15 '24 13:01 APJansen

If possible it'd be better if the stopping remains outside of Keras since it is not a given that the conditions for stopping would always be machine-learning-compatible. Moreover, it would be good to have the stopping conditions outside of keras (so Keras just gets either a "yes, I stop" or "no, I don't" like it happens right now) since it allows for the usage of other techniques. For instance the project with the quantum eigensolver.

scarlehoff avatar Jan 15 '24 14:01 scarlehoff

Ok, that's no problem. But if the stopping logic stays in the existing Stopping class but all the chi2 computations etc are moved into metrics, that should be fine right?

And just out of curiosity, do you have any references for the quantum eigensolver project?

APJansen avatar Jan 15 '24 14:01 APJansen

It was done with a much older version of the code (https://inspirehep.net/literature/1834151) but the point is that doing that kind of thing becomes much more difficult as more things become entangled with tensorflow.

class but all the chi2 computations etc are moved into metrics, that should be fine right?

I was mostly thinking about the decision on whether to stop or not (so, looking at positivity, deciding what is a plateau and what is not etc). The computation of the loss itself is already coming from the model.

scarlehoff avatar Jan 15 '24 14:01 scarlehoff

Just for my future self as it will probably be a week or more before I continue this:

  • Make sure all the callbacks are working, need to test >100 epochs for that
  • ModelTrainer.evaluate should be replaced with a call to self.observables_model followed by calls to the 3 losses (which should probably be stored as attributes).
  • compute_losses still used in hyperopt, fix there as well
  • remove training/validation/experimental models once no longer needed
  • Implement loading of previous epoch's weights. Look at ModelCheckpoint or BackupAndRestore built in callbacks
  • nice but not strictly necessary for this PR: strip out everything not directly related to stopping logic from Stopping, most of it is about evaluation. Can be taken up by metrics, or just inside modeltrainer. Figure out also the normalization done in parse_ndata in stopping, if that can't be incorporated in the metrics as well.

APJansen avatar Jan 15 '24 17:01 APJansen

The bulk of the work is done, and I've tested that the speedup is indeed about 33% (tested with 1 replica on CPU with NNPDF40_nnlo_as_01180_1000.yml, but it should be the same for e.g. 100 replicas on GPU, as it just skips about 33% of the work).

Outcomes on a small runcard I've tested are identical training chi2s, identical validation chi2s but shifted 1 step, and identical final chi2's (train/val/exp) after reverting the weights one step, which was trivial to do (only at the cost of storing 2 copies of the weights, but they are tiny).

The structure now is that we have one ModelTrainer.observables_model that computes all observables, and a dictionary ModelTrainer.losses with keys "training", "validation", "experimental", and values for each are again dictionaries of per experiment losses, appropriately masked and filtered for each split.

Note that all losses can be given all observables as input, Keras will just select only the one it needs. Also, we can include the experimental observables in the same model without performance cost, as tensorflow will realise during training that those outputs aren't being used in the loss and avoid computing them. I verified this using the script below, where the small and smallest model have identical training times, while the big one takes 20 times longer.

It needs some minor tweaks and more testing, but before going into that I would like to know if you agree now with this approach broadly speaking @scarlehoff. I've also cleaned up the stopping module a bit more, where it was needed for the main changes here, though certainly more can be done. (FitState and FitHistory could at this point be easily removed entirely, but this PR is already getting too big)

timing script
import time
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np

i_small = Input(shape=(1,))
i_big = Input(shape=(1000,))

d_small = Dense(1)(i_small)
d_large = Dense(300_000)(i_big)
d2_large = Dense(300_000)(i_big)
d3_large = Dense(1)(d2_large)

model = Model(inputs=[i_small, i_big], outputs=[d_small, d3_large])
model_small = Model(inputs=[i_small, i_big], outputs=[d_small])
model_smallest = Model(inputs=[i_small], outputs=[d_small])

model.compile(optimizer="adam", loss="mse")
model_small.compile(optimizer="adam", loss="mse")
model_smallest.compile(optimizer="adam", loss="mse")

x_small = np.random.rand(1000, 1)
x_big = np.random.rand(1000, 1000)
y_small = np.random.rand(1000, 1)
y_big = np.random.rand(1000, 1)

def timefit(model, xs, ys):
    start = time.time()
    model.fit(xs, ys)
    end = time.time()
    print(f"Time for fit: {end-start:.5} s")

timefit(model_smallest, x_small, y_small)
timefit(model_small, [x_small, x_big], [y_small, y_big])
timefit(model, [x_small, x_big], [y_small, y_big])

APJansen avatar Jan 19 '24 13:01 APJansen