pymc icon indicating copy to clipboard operation
pymc copied to clipboard

Prediction could be better if shape of observed would not need to be considered

Open twiecki opened this issue 2 years ago • 13 comments

When I have a simple model:

import pymc as pm

with pm.Model() as m:
    x = pm.MutableData("x", input_data)
    y = pm.MutableData("y" , output_data)
    beta = pm.Normal("beta")
    y_est = x * beta
    pm.Normal("y", y_est, observed=y)

Now when I want to predict on new data, I have to do:

with m:
    pm.set_data({"x": np.linspace(0, 10, 100), "y": np.zeros(100)})

because x and y need to have the same shape or I get an error. But it seems wrong that I need to set y to some dummy-thing just for the shape. So what I think it should look like is this:

with m:
    pm.set_data({"x": np.linspace(0, 10, 100)})

I'm not sure what the right solution is, perhaps there is a way to link the shapes somehow, so that when I change x, y automatically also gets changed? Or that when I call pm.sample_posterior_predictive() the data in observed just gets removed and the shape should automatically propagate anyway?

CC @ricardoV94

twiecki avatar Jul 19 '22 08:07 twiecki

Every solution other than returning a new model will have nasty side effects on the current model.

But this may actually result in an okay API:

with pmodel.set_data({
    "x" : ...,
}) as postpred_model:
    idatapp = pm.sample_posterior_predictive()

michaelosthege avatar Jul 19 '22 19:07 michaelosthege

So model.set_data() would create a new model object?

twiecki avatar Jul 19 '22 20:07 twiecki

Has this been solved in the meantime? The following works for me and the shape change propagates from x to y when setting the data. I can also sample the posterior. Modified from the original example (I had to rename the observed to y_obs, otherwise there is a name clash with y):

import numpy as np
import pandas as pd

import pymc as pm

rng = np.random.default_rng(seed=0)
input_data = np.arange(10, dtype=float)
output_data = 0.5*input_data + 10 + rng.random(size=input_data.shape[0])

with pm.Model() as m:
    x = pm.MutableData("x", input_data)
    y_obs = pm.MutableData("y_obs" , output_data)
    beta = pm.Normal("beta")
    y_est = x * beta
    y = pm.Normal("y", y_est, observed=y_obs)
    samples = pm.sample(100, tune=100, cores=1)
print([v.eval().shape for v in [x, y_est, y]]) # All shape (10,)

with m:
    pm.set_data({"x": np.linspace(0, 10, 100)})
    prediction_samples = pm.sample_posterior_predictive(samples)
print([v.eval().shape for v in [x, y_est, y]]) # All shape (100,)

I ran this on the latest commit (ad16bf48a1703b80cee3d3fef9df1dd0ae552d7c).

bherwerth avatar Aug 08 '22 18:08 bherwerth

I've just found out that the example fails when I explicitly set the shape of y like this:

    y = pm.Normal("y", y_est, observed=y_obs, shape=(10,))

In this case, I get a shape error when trying to sample the posterior. Thus, the following seems to be the case:

  • If the shape of y is inferred from the distribution parameters (normal's meany_est), then the shape change propagates through when re-setting x through model.set_data.
  • If the shape is set explicitly, it seems to be hard-coded and is not re-set by model.set_data.

bherwerth avatar Aug 08 '22 19:08 bherwerth

(I think) you can specify shape=x.shape in the likelihood

with pm.Model() as m:
    x = pm.MutableData("x", input_data)
    beta = pm.Normal("beta")
    y_est = x * beta
    y = pm.Normal("y", y_est, observed=y_obs, shape=x.shape)

ricardoV94 avatar Aug 08 '22 19:08 ricardoV94

Yes, using shape=x.shape works as well and the shape change get's propagated to y.

But then the current behavior is already reasonable, isn't it?

bherwerth avatar Aug 09 '22 17:08 bherwerth

Yes, using shape=x.shape works as well and the shape change get's propagated to y.

But then the current behavior is already reasonable, isn't it?

Yes, I think it is reasonable, but we perhaps need to advertise it better (e.g., in one of the GLM notebooks)

The problem is that users don't usually think about specifying shape for likelihoods because PyMC is so handy and infers it from the observed data. But sometimes users know something more about the shape of the likelihood (e.g., it should be as large as x which they will resize for posterior predictive) but PyMC can't be expected to know this.

ricardoV94 avatar Aug 09 '22 21:08 ricardoV94

The problem is that users don't usually think about specifying shape for likelihoods because PyMC is so handy and infers it from the observed data.

In this particular example, the inference even works and allows for changing x without explicitly specifying the shape of y. At least these were my findings:

  • y = pm.Normal("y", y_est, observed=y_obs) works because y seems to get its shape from the distribution parameters instead of the observed.
  • y = pm.Normal("y", y_est, observed=y_obs, shape=x.shape) works as well because the shape is explicitly tied to x
  • y = pm.Normal("y", y_est, observed=y_obs, shape=(10,)) fails because the shape is hard-coded to (10,)

Maybe I am missing the point of the issue, but for me it works as I would expect...

bherwerth avatar Aug 11 '22 20:08 bherwerth

The first case does not always work, and not for all distributions. For instance if you have multiple observations per mean parameter.

The last case should indeed not work (it works as expected)

ricardoV94 avatar Aug 11 '22 20:08 ricardoV94

Maybe I am missing the point of the issue, but for me it works as I would expect...

The first example is still the point of the issue. I think adding an example to the GLM notebook where shape=x.shape is used may be a good enough solution to the issue.

ricardoV94 avatar Aug 11 '22 20:08 ricardoV94

Got stuck on this recently too. I really like the shape=x.shape solution -- but I don't know if I would've thought to check the GLM notebook. I think pm.Data, pm.MutableData, and especially pm.set_data would be good places to document this too, since this only happens when you're interacting there.

bwengals avatar Aug 31 '22 18:08 bwengals

Got stuck on this recently too. I really like the shape=x.shape solution -- but I don't know if I would've thought to check the GLM notebook. I think pm.Data, pm.MutableData, and especially pm.set_data would be good places to document this too, since this only happens when you're interacting there.

Sounds good.

ricardoV94 avatar Aug 31 '22 19:08 ricardoV94

Cool, put a PR up here #6087

bwengals avatar Aug 31 '22 21:08 bwengals