pints icon indicating copy to clipboard operation
pints copied to clipboard

Handling outputs from the same model with different likelihoods / score functions

Open ben18785 opened this issue 3 years ago • 29 comments

PKPD models will likely need outputs from the same model with different likelihoods. E.g. one output follows a normal model; another follows a log-normal. At the moment, this sort of thing is not possible in PINTS without lots of manual hacking. Similarly, it is not possible to natively handle an inference problem with multiple outputs where those outputs are measured at different times.

I wanted to open the forum on this. Here's a potential proposal.

We create a MasterProblemCollection with the following methods:

  • initialised with a forward model and values as is usual. It is also (crucially) initialised with some way of indexing which outputs correspond to which SubProblem (see below). At a simple level, this could be a list [1,1,1,2,2,2,2,3,3,3,3,3] which would indicate that the first three forward model outputs correspond to subproblem1, the second four to subproblem2 and the last five to subproblem3. It is also initiated with a list of times lists: one list of times for each SubProblem.
  • evaluate takes an argument index (in the above example, this would be either 1, 2 or 3) which specifies which output set to return. The first time evaluate is called, the result is cached so that the model is only solved once -- not once for each output set.
  • n_outputs also takes index and returns the relevant number of outputs for the SubProblem
  • times takes index and returns the appropriate time list

We also create a SubOutputProblem:

  • intialised with a MasterProblemCollection and an index
  • evaluate calls MasterProblemCollection.evaluate(index)
  • and so on for n_outputs etc

Example use case

User has a forward model with three outputs and would to model the first two using a GaussianLogLikelihood and the third with LogNormalLogLikelihood.

They would do:

times_1 = [1, 2, 4, 6]
times_2 = [3, 5]
times = [times_1, times_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times, value, index_list)

subproblem_1 = SubProblem(master, 1)
subproblem_2 = SubProblem(master, 2)

loglikelihood_1 = GaussianLogLikelihood(subproblem_1)
loglikelihood_2 = LogNormalLogLikelihood(subproblem_2)
logprior_1 = ...
logprior_2 = ...

logposterior_1 = LogPosterior(loglikelihood_1, logprior_1)
logposterior_2 = LogPosterior(loglikelihood_2, logprior_2)

The user could then create a wrapper to create a LogPosterior class that just returns the sum of logposterior_1 and logposterior_2 which they could use for inference.

@martinjrobins @MichaelClerx @chonlei (and anyone else!) interested to hear thoughts on this

ben18785 avatar Oct 11 '21 14:10 ben18785

PKPD models will likely need outputs from the same model with different likelihoods. E.g. one output follows a normal model; another follows a log-normal.

I don't understand what you mean here. The PKPD model would be an ODE model, right? So where do the distributions come in?

Similarly, it is not possible to natively handle an inference problem with multiple outputs where those outputs are measured at different times.

Just for my intuition, can you give me examples of why you'd want or need to do this?

MichaelClerx avatar Oct 12 '21 09:10 MichaelClerx

just a note to say we will definitely need this for the pkpdapp inference. Different outputs of the ODE model will have different noise models, hence the need for different distributions on different outputs. The different outputs might also be measured at different times, e.g. if you are measuring the concentration of the drug as well as the size of a tumour, these wouldn't neccessarily be at the same time.

martinjrobins avatar Oct 12 '21 09:10 martinjrobins

Thanks @MichaelClerx

Yes, PKPD models are just ODE models. What I mean is that, for these types of model, when you want to do inference, it seems to often be the case that some of the outputs would be modelling using Gaussian distributions; others with log-normal distributions.

Re: different outputs having different measured times, I can give you a good example from epidemiology but there are lots of cases in PKPD modelling it seems due to non-regular measurement of different things (e.g. tumour size and inflammation markers). For SEIR-type modelling, it is often the case that some outputs (e.g. cases and deaths) are measured regularly (e.g. daily) whereas others are measured less regularly (e.g. the proportion of population who are seropositive).

ben18785 avatar Oct 12 '21 09:10 ben18785

Thanks both! Worth keeping https://github.com/pints-team/pints/issues/1372 in mind too I guess

MichaelClerx avatar Oct 12 '21 09:10 MichaelClerx

Caching sounds useful whether or not we have a split analysis of the model output, so... separate ticket? I'm assuming you want to evaluate all times in a single simulation? Then I'd say there's two obvious ways:

  1. Like you suggest, construct some kind of SuperProblem from SubProblems (let's shy away from "master" for now?). Then this problem would combine the requirements of individual subproblems, run a single sim (or retrieve a cached output), and then split the output up again. E.g. it would have to combine your times_1 = [1, 2, 4, 6] and times_2 = [3, 5] into times=[1,2,3,4,5,6] and then be clever enough to split the resulting data sets again.

  2. Instead, we could also consider filtering. E.g. have a class ProblemFilter(problem, filter) where filter is something like a boolean array of the same size as the expected output, or a lambda that the user provides. (Or to make it friendlier you could have explicit filters for times and for outputs, but that wouldn't be quite as flexible?) Then you'd (1) Create a MultiSeriesProblem as usual, (but set caching=True) (2) Create a bunch of filters, (3) Create a composed likelihood (as in your example)

MichaelClerx avatar Oct 12 '21 10:10 MichaelClerx

Yep, spot on with what I was thinking re: 1 and, yes, happy to shy away from "master"!

Re: filters, I'm not sure I get ya? Also, currently an issue with MultiSeriesProblem is that this doesn't allow different time arrays for each output, so I guess that'd need to be re-engineered?

ben18785 avatar Oct 13 '21 14:10 ben18785

@MichaelClerx I'm going to give my suggestion a go here as need to make progress for PKPD app. But, look forward to your thoughts on the PR!

ben18785 avatar Oct 18 '21 09:10 ben18785

Yep, spot on with what I was thinking re: 1 and, yes, happy to shy away from "master"!

Re: filters, I'm not sure I get ya? Also, currently an issue with MultiSeriesProblem is that this doesn't allow different time arrays for each output, so I guess that'd need to be re-engineered?

Hold on! Can I try explaining the filter thing a bit?

MichaelClerx avatar Oct 18 '21 09:10 MichaelClerx

Sure!

ben18785 avatar Oct 18 '21 09:10 ben18785

Basically, in your proposal you start by creating a superproblem that then needs to combine some sub problems, run a sim, and split its output again. I'm wondering why you don't cut out the first step, and just create methods to split a problem's output into sub outputs?

So you'd have an ordinary MultiSeriesProblem, but then you'd make sub problems based on its output, and finally combine those sub problems into a single likelihood (although you could hide output as well, using this method)

MichaelClerx avatar Oct 18 '21 09:10 MichaelClerx

All the splitting code would need to do would be e.g.

def evaluate(parameters):
    all_values = parent_problem.evaluate(parameters)
    selected_values = all_values[self._numpy_filter]
    return selected_values

or

def evaluate(parameters):
    all_values = parent_problem.evaluate(parameters)
    selected_values = self._very_flexible_filter(all_values)
    return selected_values

MichaelClerx avatar Oct 18 '21 09:10 MichaelClerx

The thinking behind this is:

  • we've already got ways to combine problems into a single function
  • having it work with any old multi (or single) series problem is more flexible than requiring a special type
  • this system would potentially allow for more complicated filters than in your original proposal (the user just passes in either a numpy filter or a lambda function)
  • the efficiency comes from using caching in the problem class separately, so the code we write to do that will be useful to others as well

MichaelClerx avatar Oct 18 '21 10:10 MichaelClerx

Ok, thanks. So would the MultiSeriesProblem take all the times where all outputs were evaluated? E.g.:

times_1 = [1, 2, 4, 6] # output 1
times_2 = [3, 5] # output 2
times_all = [1,2,3,4,5,6]
problem = MultiOutputProblem(model, times_all, values_all)

# construct sub problems which are then used to construct individual log-likelihoods
subproblem_1 = SubProblem(problem, filter_1)
...

How would you handle values_all? In the current MultiOutputProblem this needs to be a numpy array of size n_times by n_outputs. But in the inference problems I'm thinking about we won't have observations for all times for all outputs.

ben18785 avatar Oct 18 '21 10:10 ben18785

Oooh that's a good point, I was thinking about the simulation much more than the data 😅

Ideally I guess you'd have a None in the data set to denote no value, but may need to use a nan so that it fits into a numpy doubles array...

But if we start allowing that, then all the likelihoods etc. would need to do some clever filtering? Or... do we write that nan may appear in a multioutputseries problem's values, but that we expect the user to construct a clever score/likelihood so that this isn't a problem?

MichaelClerx avatar Oct 18 '21 10:10 MichaelClerx

So would the MultiSeriesProblem take all the times where all outputs were evaluated?

Yup!

MichaelClerx avatar Oct 18 '21 10:10 MichaelClerx

How were you planning to handle data in the super/sub problem set-up?

MichaelClerx avatar Oct 18 '21 10:10 MichaelClerx

So the way I'd handle data would be the same way as for times:

## times and values for first output
times_1 = [1, 2, 4, 6]
values_1 = [2.4, 3.4, 5.5, 5.5]

## times and values for second output
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

## combine all
times_all = [times_1, times_2]
values_all = [values_1, values_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times_all, values_all, index_list)

So, we could put nans in the numpy arrays, although this feels a little clunky. I do wonder though whether this filtering means users are more likely to make mistakes -- specifically if their filters don't grab the right values?

ben18785 avatar Oct 18 '21 11:10 ben18785

Is index_list correct in that example?

MichaelClerx avatar Oct 18 '21 11:10 MichaelClerx

No, good spot. For this case, it would just be index_list=[1,2].

If you wanted to create subproblems which corresponded to multioutputs, I guess you could do:

## times and values for two-dimensional outputs
times_1 = [1, 2, 4, 6]
values_1 = [[2.4, 3.4, 5.5, 5.5],[4.5,......]]

## times and values for second output
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

## combine all
times_all = [times_1, times_2]
values_all = [values_1, values_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times_all, values_all, index_list)

then the index list I gave would make more sense.

ben18785 avatar Oct 18 '21 11:10 ben18785

The efficiency bit does make this tricky! I'm really tempted to say you should solve it on the ForwardModel level, not the Problem level, so that you end up with

subproblem1 = Problem(submodel1, times1, values1)
subproblem2 = Problem(submodel2, times2, values2)

but I can't think of a generic way to split a forwardmodel into 2 submodels that recognise they need to use a fixed time set and then filter

MichaelClerx avatar Oct 18 '21 11:10 MichaelClerx

model = MyNiceModel()
model_with_fixed_times = FixedTimesModel(model, times)
submodel1 = FilteredModel(model_with_fixed_times, times, outputs) --> can check its times are a subset of parent model's times
submodel2 = FilteredModel(...)

yuck!

MichaelClerx avatar Oct 18 '21 11:10 MichaelClerx

Interesting -- to me -- because the necessity to do this all stems from the inference problem (i.e. having different likelihoods for different outputs) I would be more tempted to solve it at that level than at the forward problem level...?

ben18785 avatar Oct 18 '21 11:10 ben18785

I think the root problem isn't so much different outputs, but different time series. We've gone from [(t1, z1), (t1, z2), ..., (tN, zN)] (where z is a scalar or a vector) to [[(t11, z11), (t11, z12), ..., (t1N, z1N)], [(t21, z21), (t21, z22), ..., (t2M, z2M)], ...]

To me that means we now have a list of Problems, for which we can define a list of likelihoods which we can then combine into a single likelihood using existing code. But we've added the question of how to write a forward model that will have its output split over multiple Problems

MichaelClerx avatar Oct 18 '21 11:10 MichaelClerx

I suppose you were saying the same when you called it a ProblemCollection :D

MichaelClerx avatar Oct 18 '21 11:10 MichaelClerx

OK I'm coming round to your original proposal :D

Maybe we can drop index_list though, by assuming the outputs don't overlap, and that the user has control over the forwardmodel, and so can ensure that the outputs are ordered as e.g. [0, 0, 1] instead of [0, 1, 0]?

Then it could be:

# Time series for measurable A (2d)
times_1 = [1, 2, 4, 6]
values_1 = [[2.4, 3.4, 5.5, 5.5],[4.5,......]]

# Time series for measurable B (1d)
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

# Combine all
collection = SharedModelProblem(forward_model, times_1, values_1, times_2, values_2)

which you'd implement as e.g.

def __init__(self, forward_model, *args):
    if len(args) < 2:
      must supply at least one time series
    if len(args) % 2 != 0:
      must supply times and values for each time series
    timeses = []
    valueses = []
    index_list = []
    for i in range(len(args) %2):
      timeses.append(args[i])
      valueses.append(args[1 + i])
      index_list.extend([i] * args[1 + i].shape[1])
    times = list(set(np.concatenate(etc)))

that kind of thing?

MichaelClerx avatar Oct 18 '21 12:10 MichaelClerx

That could work. But how would you point a particular SubProblem to a particular subset of outputs? Would you do:

collection = SharedModelProblem(forward_model, times_1, values_1, times_2, values_2)

## pertaining to times_1, values_1
problem_1 = SubProblem(collection, 1)

## pertaining to times_2, values_2
problem_2 = SubProblem(collection, 2)

I'd have thought keeping an explicit list of indices would mean it were safer than the implicitness above?

ben18785 avatar Oct 18 '21 12:10 ben18785

I'd also prefer to call it a SharedProblemCollection or (as I had above) ProblemCollection since it's not actually going to be of type pints.Problem.

ben18785 avatar Oct 18 '21 13:10 ben18785

I'd have thought keeping an explicit list of indices would mean it were safer than the implicitness above?

I just thought it'd be easier to have the problems all be disjoint.

You'd get a subproblem from the collection, I think:

problem_1 = collection.subproblem(0) problem_2 = collection.subproblem(1)

MichaelClerx avatar Oct 18 '21 14:10 MichaelClerx

I like this!

problem_1 = collection.subproblem(0)
problem_2 = collection.subproblem(1)

ben18785 avatar Oct 18 '21 14:10 ben18785