pymc
pymc copied to clipboard
Prediction could be better if shape of observed would not need to be considered
When I have a simple model:
import pymc as pm
with pm.Model() as m:
x = pm.MutableData("x", input_data)
y = pm.MutableData("y" , output_data)
beta = pm.Normal("beta")
y_est = x * beta
pm.Normal("y", y_est, observed=y)
Now when I want to predict on new data, I have to do:
with m:
pm.set_data({"x": np.linspace(0, 10, 100), "y": np.zeros(100)})
because x
and y
need to have the same shape or I get an error. But it seems wrong that I need to set y
to some dummy-thing just for the shape. So what I think it should look like is this:
with m:
pm.set_data({"x": np.linspace(0, 10, 100)})
I'm not sure what the right solution is, perhaps there is a way to link the shapes somehow, so that when I change x
, y
automatically also gets changed? Or that when I call pm.sample_posterior_predictive()
the data in observed
just gets removed and the shape should automatically propagate anyway?
CC @ricardoV94
Every solution other than returning a new model will have nasty side effects on the current model.
But this may actually result in an okay API:
with pmodel.set_data({
"x" : ...,
}) as postpred_model:
idatapp = pm.sample_posterior_predictive()
So model.set_data()
would create a new model object?
Has this been solved in the meantime? The following works for me and the shape change propagates from x
to y
when setting the data. I can also sample the posterior. Modified from the original example (I had to rename the observed to y_obs
, otherwise there is a name clash with y
):
import numpy as np
import pandas as pd
import pymc as pm
rng = np.random.default_rng(seed=0)
input_data = np.arange(10, dtype=float)
output_data = 0.5*input_data + 10 + rng.random(size=input_data.shape[0])
with pm.Model() as m:
x = pm.MutableData("x", input_data)
y_obs = pm.MutableData("y_obs" , output_data)
beta = pm.Normal("beta")
y_est = x * beta
y = pm.Normal("y", y_est, observed=y_obs)
samples = pm.sample(100, tune=100, cores=1)
print([v.eval().shape for v in [x, y_est, y]]) # All shape (10,)
with m:
pm.set_data({"x": np.linspace(0, 10, 100)})
prediction_samples = pm.sample_posterior_predictive(samples)
print([v.eval().shape for v in [x, y_est, y]]) # All shape (100,)
I ran this on the latest commit (ad16bf48a1703b80cee3d3fef9df1dd0ae552d7c).
I've just found out that the example fails when I explicitly set the shape of y
like this:
y = pm.Normal("y", y_est, observed=y_obs, shape=(10,))
In this case, I get a shape error when trying to sample the posterior. Thus, the following seems to be the case:
- If the shape of
y
is inferred from the distribution parameters (normal's meany_est
), then the shape change propagates through when re-settingx
throughmodel.set_data
. - If the shape is set explicitly, it seems to be hard-coded and is not re-set by
model.set_data
.
(I think) you can specify shape=x.shape
in the likelihood
with pm.Model() as m:
x = pm.MutableData("x", input_data)
beta = pm.Normal("beta")
y_est = x * beta
y = pm.Normal("y", y_est, observed=y_obs, shape=x.shape)
Yes, using shape=x.shape
works as well and the shape change get's propagated to y
.
But then the current behavior is already reasonable, isn't it?
Yes, using
shape=x.shape
works as well and the shape change get's propagated toy
.But then the current behavior is already reasonable, isn't it?
Yes, I think it is reasonable, but we perhaps need to advertise it better (e.g., in one of the GLM notebooks)
The problem is that users don't usually think about specifying shape for likelihoods because PyMC is so handy and infers it from the observed data. But sometimes users know something more about the shape of the likelihood (e.g., it should be as large as x
which they will resize for posterior predictive) but PyMC can't be expected to know this.
The problem is that users don't usually think about specifying shape for likelihoods because PyMC is so handy and infers it from the observed data.
In this particular example, the inference even works and allows for changing x
without explicitly specifying the shape of y
. At least these were my findings:
-
y = pm.Normal("y", y_est, observed=y_obs)
works becausey
seems to get its shape from the distribution parameters instead of the observed. -
y = pm.Normal("y", y_est, observed=y_obs, shape=x.shape)
works as well because the shape is explicitly tied tox
-
y = pm.Normal("y", y_est, observed=y_obs, shape=(10,))
fails because the shape is hard-coded to(10,)
Maybe I am missing the point of the issue, but for me it works as I would expect...
The first case does not always work, and not for all distributions. For instance if you have multiple observations per mean parameter.
The last case should indeed not work (it works as expected)
Maybe I am missing the point of the issue, but for me it works as I would expect...
The first example is still the point of the issue. I think adding an example to the GLM notebook where shape=x.shape
is used may be a good enough solution to the issue.
Got stuck on this recently too. I really like the shape=x.shape
solution -- but I don't know if I would've thought to check the GLM notebook. I think pm.Data
, pm.MutableData
, and especially pm.set_data
would be good places to document this too, since this only happens when you're interacting there.
Got stuck on this recently too. I really like the
shape=x.shape
solution -- but I don't know if I would've thought to check the GLM notebook. I thinkpm.Data
,pm.MutableData
, and especiallypm.set_data
would be good places to document this too, since this only happens when you're interacting there.
Sounds good.
Cool, put a PR up here #6087