pints
pints copied to clipboard
Handling outputs from the same model with different likelihoods / score functions
PKPD models will likely need outputs from the same model with different likelihoods. E.g. one output follows a normal model; another follows a log-normal. At the moment, this sort of thing is not possible in PINTS without lots of manual hacking. Similarly, it is not possible to natively handle an inference problem with multiple outputs where those outputs are measured at different times.
I wanted to open the forum on this. Here's a potential proposal.
We create a MasterProblemCollection
with the following methods:
- initialised with a forward model and values as is usual. It is also (crucially) initialised with some way of indexing which outputs correspond to which
SubProblem
(see below). At a simple level, this could be a list[1,1,1,2,2,2,2,3,3,3,3,3]
which would indicate that the first three forward model outputs correspond to subproblem1, the second four to subproblem2 and the last five to subproblem3. It is also initiated with a list oftimes
lists: one list of times for eachSubProblem
. -
evaluate
takes an argumentindex
(in the above example, this would be either 1, 2 or 3) which specifies which output set to return. The first time evaluate is called, the result is cached so that the model is only solved once -- not once for each output set. -
n_outputs
also takesindex
and returns the relevant number of outputs for theSubProblem
-
times
takesindex
and returns the appropriate time list
We also create a SubOutputProblem
:
- intialised with a
MasterProblemCollection
and anindex
-
evaluate
callsMasterProblemCollection.evaluate(index)
- and so on for
n_outputs
etc
Example use case
User has a forward model with three outputs and would to model the first two using a GaussianLogLikelihood
and the third with LogNormalLogLikelihood
.
They would do:
times_1 = [1, 2, 4, 6]
times_2 = [3, 5]
times = [times_1, times_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times, value, index_list)
subproblem_1 = SubProblem(master, 1)
subproblem_2 = SubProblem(master, 2)
loglikelihood_1 = GaussianLogLikelihood(subproblem_1)
loglikelihood_2 = LogNormalLogLikelihood(subproblem_2)
logprior_1 = ...
logprior_2 = ...
logposterior_1 = LogPosterior(loglikelihood_1, logprior_1)
logposterior_2 = LogPosterior(loglikelihood_2, logprior_2)
The user could then create a wrapper to create a LogPosterior
class that just returns the sum of logposterior_1
and logposterior_2
which they could use for inference.
@martinjrobins @MichaelClerx @chonlei (and anyone else!) interested to hear thoughts on this
PKPD models will likely need outputs from the same model with different likelihoods. E.g. one output follows a normal model; another follows a log-normal.
I don't understand what you mean here. The PKPD model would be an ODE model, right? So where do the distributions come in?
Similarly, it is not possible to natively handle an inference problem with multiple outputs where those outputs are measured at different times.
Just for my intuition, can you give me examples of why you'd want or need to do this?
just a note to say we will definitely need this for the pkpdapp inference. Different outputs of the ODE model will have different noise models, hence the need for different distributions on different outputs. The different outputs might also be measured at different times, e.g. if you are measuring the concentration of the drug as well as the size of a tumour, these wouldn't neccessarily be at the same time.
Thanks @MichaelClerx
Yes, PKPD models are just ODE models. What I mean is that, for these types of model, when you want to do inference, it seems to often be the case that some of the outputs would be modelling using Gaussian distributions; others with log-normal distributions.
Re: different outputs having different measured times, I can give you a good example from epidemiology but there are lots of cases in PKPD modelling it seems due to non-regular measurement of different things (e.g. tumour size and inflammation markers). For SEIR-type modelling, it is often the case that some outputs (e.g. cases and deaths) are measured regularly (e.g. daily) whereas others are measured less regularly (e.g. the proportion of population who are seropositive).
Thanks both! Worth keeping https://github.com/pints-team/pints/issues/1372 in mind too I guess
Caching sounds useful whether or not we have a split analysis of the model output, so... separate ticket? I'm assuming you want to evaluate all times in a single simulation? Then I'd say there's two obvious ways:
-
Like you suggest, construct some kind of SuperProblem from SubProblems (let's shy away from "master" for now?). Then this problem would combine the requirements of individual subproblems, run a single sim (or retrieve a cached output), and then split the output up again. E.g. it would have to combine your
times_1 = [1, 2, 4, 6]
andtimes_2 = [3, 5]
intotimes=[1,2,3,4,5,6]
and then be clever enough to split the resulting data sets again. -
Instead, we could also consider filtering. E.g. have a class
ProblemFilter(problem, filter)
wherefilter
is something like a boolean array of the same size as the expected output, or a lambda that the user provides. (Or to make it friendlier you could have explicit filters for times and for outputs, but that wouldn't be quite as flexible?) Then you'd (1) Create a MultiSeriesProblem as usual, (but setcaching=True
) (2) Create a bunch of filters, (3) Create a composed likelihood (as in your example)
Yep, spot on with what I was thinking re: 1 and, yes, happy to shy away from "master"!
Re: filters, I'm not sure I get ya? Also, currently an issue with MultiSeriesProblem is that this doesn't allow different time arrays for each output, so I guess that'd need to be re-engineered?
@MichaelClerx I'm going to give my suggestion a go here as need to make progress for PKPD app. But, look forward to your thoughts on the PR!
Yep, spot on with what I was thinking re: 1 and, yes, happy to shy away from "master"!
Re: filters, I'm not sure I get ya? Also, currently an issue with MultiSeriesProblem is that this doesn't allow different time arrays for each output, so I guess that'd need to be re-engineered?
Hold on! Can I try explaining the filter thing a bit?
Sure!
Basically, in your proposal you start by creating a superproblem that then needs to combine some sub problems, run a sim, and split its output again. I'm wondering why you don't cut out the first step, and just create methods to split a problem's output into sub outputs?
So you'd have an ordinary MultiSeriesProblem, but then you'd make sub problems based on its output, and finally combine those sub problems into a single likelihood (although you could hide output as well, using this method)
All the splitting code would need to do would be e.g.
def evaluate(parameters):
all_values = parent_problem.evaluate(parameters)
selected_values = all_values[self._numpy_filter]
return selected_values
or
def evaluate(parameters):
all_values = parent_problem.evaluate(parameters)
selected_values = self._very_flexible_filter(all_values)
return selected_values
The thinking behind this is:
- we've already got ways to combine problems into a single function
- having it work with any old multi (or single) series problem is more flexible than requiring a special type
- this system would potentially allow for more complicated filters than in your original proposal (the user just passes in either a numpy filter or a lambda function)
- the efficiency comes from using caching in the problem class separately, so the code we write to do that will be useful to others as well
Ok, thanks. So would the MultiSeriesProblem
take all the times where all outputs were evaluated? E.g.:
times_1 = [1, 2, 4, 6] # output 1
times_2 = [3, 5] # output 2
times_all = [1,2,3,4,5,6]
problem = MultiOutputProblem(model, times_all, values_all)
# construct sub problems which are then used to construct individual log-likelihoods
subproblem_1 = SubProblem(problem, filter_1)
...
How would you handle values_all
? In the current MultiOutputProblem
this needs to be a numpy array of size n_times
by n_outputs
. But in the inference problems I'm thinking about we won't have observations for all times for all outputs.
Oooh that's a good point, I was thinking about the simulation much more than the data 😅
Ideally I guess you'd have a None
in the data set to denote no value, but may need to use a nan
so that it fits into a numpy doubles array...
But if we start allowing that, then all the likelihoods etc. would need to do some clever filtering? Or... do we write that nan
may appear in a multioutputseries problem's values, but that we expect the user to construct a clever score/likelihood so that this isn't a problem?
So would the MultiSeriesProblem take all the times where all outputs were evaluated?
Yup!
How were you planning to handle data in the super/sub problem set-up?
So the way I'd handle data would be the same way as for times:
## times and values for first output
times_1 = [1, 2, 4, 6]
values_1 = [2.4, 3.4, 5.5, 5.5]
## times and values for second output
times_2 = [3, 5]
values_2 = [-0.3, 0.5]
## combine all
times_all = [times_1, times_2]
values_all = [values_1, values_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times_all, values_all, index_list)
So, we could put nan
s in the numpy arrays, although this feels a little clunky. I do wonder though whether this filtering means users are more likely to make mistakes -- specifically if their filters don't grab the right values?
Is index_list
correct in that example?
No, good spot. For this case, it would just be index_list=[1,2]
.
If you wanted to create subproblems which corresponded to multioutputs, I guess you could do:
## times and values for two-dimensional outputs
times_1 = [1, 2, 4, 6]
values_1 = [[2.4, 3.4, 5.5, 5.5],[4.5,......]]
## times and values for second output
times_2 = [3, 5]
values_2 = [-0.3, 0.5]
## combine all
times_all = [times_1, times_2]
values_all = [values_1, values_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times_all, values_all, index_list)
then the index list I gave would make more sense.
The efficiency bit does make this tricky! I'm really tempted to say you should solve it on the ForwardModel level, not the Problem level, so that you end up with
subproblem1 = Problem(submodel1, times1, values1)
subproblem2 = Problem(submodel2, times2, values2)
but I can't think of a generic way to split a forwardmodel into 2 submodels that recognise they need to use a fixed time set and then filter
model = MyNiceModel()
model_with_fixed_times = FixedTimesModel(model, times)
submodel1 = FilteredModel(model_with_fixed_times, times, outputs) --> can check its times are a subset of parent model's times
submodel2 = FilteredModel(...)
yuck!
Interesting -- to me -- because the necessity to do this all stems from the inference problem (i.e. having different likelihoods for different outputs) I would be more tempted to solve it at that level than at the forward problem level...?
I think the root problem isn't so much different outputs, but different time series. We've gone from [(t1, z1), (t1, z2), ..., (tN, zN)]
(where z is a scalar or a vector) to [[(t11, z11), (t11, z12), ..., (t1N, z1N)], [(t21, z21), (t21, z22), ..., (t2M, z2M)], ...]
To me that means we now have a list of Problems, for which we can define a list of likelihoods which we can then combine into a single likelihood using existing code. But we've added the question of how to write a forward model that will have its output split over multiple Problems
I suppose you were saying the same when you called it a ProblemCollection :D
OK I'm coming round to your original proposal :D
Maybe we can drop index_list though, by assuming the outputs don't overlap, and that the user has control over the forwardmodel, and so can ensure that the outputs are ordered as e.g. [0, 0, 1] instead of [0, 1, 0]?
Then it could be:
# Time series for measurable A (2d)
times_1 = [1, 2, 4, 6]
values_1 = [[2.4, 3.4, 5.5, 5.5],[4.5,......]]
# Time series for measurable B (1d)
times_2 = [3, 5]
values_2 = [-0.3, 0.5]
# Combine all
collection = SharedModelProblem(forward_model, times_1, values_1, times_2, values_2)
which you'd implement as e.g.
def __init__(self, forward_model, *args):
if len(args) < 2:
must supply at least one time series
if len(args) % 2 != 0:
must supply times and values for each time series
timeses = []
valueses = []
index_list = []
for i in range(len(args) %2):
timeses.append(args[i])
valueses.append(args[1 + i])
index_list.extend([i] * args[1 + i].shape[1])
times = list(set(np.concatenate(etc)))
that kind of thing?
That could work. But how would you point a particular SubProblem
to a particular subset of outputs? Would you do:
collection = SharedModelProblem(forward_model, times_1, values_1, times_2, values_2)
## pertaining to times_1, values_1
problem_1 = SubProblem(collection, 1)
## pertaining to times_2, values_2
problem_2 = SubProblem(collection, 2)
I'd have thought keeping an explicit list of indices would mean it were safer than the implicitness above?
I'd also prefer to call it a SharedProblemCollection
or (as I had above) ProblemCollection
since it's not actually going to be of type pints.Problem
.
I'd have thought keeping an explicit list of indices would mean it were safer than the implicitness above?
I just thought it'd be easier to have the problems all be disjoint.
You'd get a subproblem from the collection, I think:
problem_1 = collection.subproblem(0) problem_2 = collection.subproblem(1)
I like this!
problem_1 = collection.subproblem(0)
problem_2 = collection.subproblem(1)