pymc icon indicating copy to clipboard operation
pymc copied to clipboard

Variables with shared inputs are always resampled from the prior in sample_posterior_predictive

Open jcstucki opened this issue 3 years ago • 14 comments

Description of your problem

Using the model.add_coord method appears to break pm.sample_posterior_predictive.

Here’s a simple example, estimating the loc and scale of three independent normals:

Please provide a minimal, self-contained, and reproducible example.

# Generate data
data = np.random.normal(loc=np.array([3, 5, 8]), scale=np.array([1.1, 6.3, 9.1]), size=(1000, 3))

# Model 1: No coords
with pm.Model() as no_coords_model:
    mu = pm.Normal('mu', mu=0, sigma=10, size=3)
    sigma = pm.HalfNormal('sigma', sigma=10, size=3)
    
    ll = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
    no_coords_trace = pm.sample()
    no_coords_post = pm.sample_posterior_predictive(no_coords_trace)

# Model 2: Context manager
coords = {'name': ['A', 'B', 'C']}
with pm.Model(coords=coords) as context_model:
    mu = pm.Normal('mu', mu=0, sigma=10, dims=['name'])
    sigma = pm.HalfNormal('sigma', sigma=10, dims=['name'])
    ll = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
    
    context_trace = pm.sample()
    context_post = pm.sample_posterior_predictive(context_trace)

# Model 3: Within model
with pm.Model() as within_model:
    within_model.add_coord('name', ['A', 'B', 'C'], mutable=True)
    mu = pm.Normal('mu', mu=0, sigma=10, dims=['name'])
    sigma = pm.HalfNormal('sigma', sigma=10, dims=['name'])
    
    ll = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
    within_trace = pm.sample()
    within_post = pm.sample_posterior_predictive(within_trace)



traces = [no_coords_trace, context_trace, within_trace]
mus = [trace.posterior.mu.values[..., i].mean() for trace in traces for i in range(3)]
sigmas = [trace.posterior.sigma.values[..., i].mean() for trace in traces for i in range(3)]
post_df = pd.DataFrame(np.c_[mus, sigmas], columns=['mu', 'sigma'], index=pd.MultiIndex.from_product([['no_coords', 'context', 'within'], ['A', 'B', 'C']]))
print(post_df.unstack(1).to_string())

                 mu                         sigma                    
                  A         B         C         A         B         C
context    2.977460  4.982624  7.826642  1.081710  6.287514  9.165928
no_coords  2.976785  4.984743  7.827109  1.081657  6.289910  9.174939
within     2.976568  4.990646  7.825051  1.081552  6.286198  9.167916


pps = [no_coords_post, context_post, within_post]
mean_value = [post.posterior_predictive.obs.values[..., i].mean() for post in pps for i in range(3)]
post_df = pd.DataFrame(mean_value, columns=['mean_ppc'], index=pd.MultiIndex.from_product([['no_coords', 'context', 'within'], ['A', 'B', 'C']]))

           mean_ppc                    
                  A         B         C
context    2.977167  4.985852  7.825006
no_coords  2.976837  4.982244  7.818495
within    -0.045788 -0.594845 -0.270400


"The dims on within_post are the same as the others, but it seems like totally wrong values are getting sampled. It is telling that the mean of each distribution is not the mean of the means, suggesting it’s not a case of correct values being shuffled."

This is pretty breaking when attempting to do out-of-sample predictions, since coords needs to be set somehow, and it can't (to my knowledge) be re-added to the model context after instantiation.

Versions and main components

  • PyMC/PyMC3 Version: 4.1.3
  • Aesara/Theano Version: 2.7.7
  • Python Version: 3.8
  • Operating system: Mac OS
  • How did you install PyMC/PyMC3: pip

jcstucki avatar Aug 10 '22 15:08 jcstucki

This sounds serious!

If you look at the control flow of Model.__init__(coords=...) you can see that internally it gets routed into Model.add_coords and thereby Model.add_coord. But in the example you posted, there's a mutable=True setting which creates the difference. Via Model(coords=...) the dimension will be immutable by default.

So I would hypothesize that this is an issue with mutable dims. And since the likelihood is broadcasted from (3,) to observed of (1000, 3) this is probably (yet another) broadcasting bug..

👉 Does it go away if you pass mutable=False in the third example?

👉 What if instead of size=3 in the first example you do three = aesara.shared(3) and size=three?

👉 What if in the third example you manually broadcast the parameters already? (ll = pm.Normal('obs', mu=mu[None, :], sigma=sigma[None, :], observed=data))

michaelosthege avatar Aug 13 '22 15:08 michaelosthege

Reporting back on the three suggestions:

  1. Passing Mutable = False in the third example fixes the bug.
  2. Passing a shared variable to size in the first example causes the bug.
  3. Manually broadcasting does not fix the bug, at least not by expanding on the first dimension.

A note on (2). This code snippet:

three = aesara.shared(3)
pm.Normal('normal', size=three)

Raised a type error for me on version 4.1.4 (windows). I instead had to pass the size in as a 1-tuple (size=(three, )). I don't know if this is relevant. It doesn't appear to be in simple tests of pm.Normal(...).eval(), but I still thought it worth mentioning.

jessegrabowski avatar Aug 15 '22 10:08 jessegrabowski

Thanks @jessegrabowski, I'd say this confirms my hypothesis: It's a bug related to broadcasted, symbolic sizes. @pymc-devs/dev-core can someone familiar with aeppl take a look at this? It does sound a little dangerous..

I instead had to pass the size in as a 1-tuple (size=(three, )). I don't know if this is relevant. It doesn't appear to be in simple tests of pm.Normal(...).eval(), but I still thought it worth mentioning.

This sounds like a smaller bug whereby a shared variable is not automatically wrapped in a tuple. This is the test case where we're testing these "lazy" flavors of passing scalars. The test doesn't cover passing a (symbolic) scalar to size yet. https://github.com/pymc-devs/pymc/blob/906fcdc7a944972c5df1021a705503f7bf42b036/pymc/tests/test_shape_handling.py#L439-L444

michaelosthege avatar Aug 15 '22 12:08 michaelosthege

Perhaps the variable mu is being resampled by mistake. https://github.com/pymc-devs/pymc/issues/5973 could help

ricardoV94 avatar Aug 15 '22 13:08 ricardoV94

As an additional test, I changed the mean of the prior distribution I put over the estimated parameters. It seems that, when using model.add_coord, pm.sample_posterior actually samples from the prior. Here's my tests:

# Model 3: Within model
with pm.Model() as within_model:
    within_model.add_coord('name', ['A', 'B', 'C'], mutable=True)
    mu = pm.Normal('mu', mu=3, sigma=10, dims=['name'])
    sigma = pm.HalfNormal('sigma', sigma=10, dims=['name'])
    
    ll = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
    within_trace = pm.sample()
    within_post = pm.sample_posterior_predictive(within_trace)
    print(within_post.posterior_predictive.obs.mean())
    >>> 2.83320012

with pm.Model() as within_model:
    within_model.add_coord('name', ['A', 'B', 'C'], mutable=True)
    mu = pm.Normal('mu', mu=-10, sigma=10, dims=['name'])
    sigma = pm.HalfNormal('sigma', sigma=10, dims=['name'])
    
    ll = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
    within_trace = pm.sample()
    within_post = pm.sample_posterior_predictive(within_trace)
    print(within_post.posterior_predictive.obs.mean())
    >>> -9.98125202

No idea why that would be happening, but it seems to suggest its not a shape or broadcasting issue. Instead, when a coord is a SharedVariable, the prior is being sample as if it were the posterior.

jessegrabowski avatar Aug 20 '22 05:08 jessegrabowski

Here are some more through outputs to show that pm.sample_posterior is sampling from the prior for all variables, and that the trigger is a shared variable on the shape:

with pm.Model() as model:
    three = aesara.shared(3)
    mu = pm.Normal('mu', mu=-10, sigma=10, shape=(three, ))
    sigma = pm.HalfNormal('sigma', sigma=11, shape=(three, ))
    
    ll = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
    trace = pm.sample()
    post = pm.sample_posterior_predictive(within_trace, var_names=['mu', 'sigma', 'obs'])

Compare posterior and posterior_predictive for mu:

# Posterior
print(trace.posterior.mu.mean(dim=['chain', 'draw']).values)
print(trace.posterior.mu.std(dim=['chain', 'draw']).values)

# Output
# [3.01789945 5.26561783 8.11427295]
# [0.03474686 0.20847781 0.29282972]

# Posterior Predictive
print(post.posterior_predictive.mu.mean(dim=['chain', 'draw']).values)
print(post.posterior_predictive.mu.std(dim=['chain', 'draw']).values)

# Output:
# [-10.33329768  -9.96841615  -9.62960897]
# [ 9.6122322  10.23897464 10.21420589]

And for sigma:

# Posterior
print(trace.posterior.sigma.mean(dim=['chain', 'draw']).values)
print(trace.posterior.sigma.std(dim=['chain', 'draw']).values)
# Output 
# [1.08169416 6.39748516 9.09000896]
# [0.02457604 0.14622595 0.20021742]

# Posterior Predictive
print(post.posterior_predictive.sigma.mean(dim=['chain', 'draw']).values)
print(post.posterior_predictive.sigma.std(dim=['chain', 'draw']).values)

# Output
# [9.01756853 8.45109316 8.63402421]
# [6.54332978 6.36108147 6.72434589]

jessegrabowski avatar Aug 20 '22 05:08 jessegrabowski

Yeah, we hadn’t thought about the sharedvalue coordinates when the designed the volatility propagation rule in sample posterior predictive. We will have to change it slightly. If the user supplies an inference data object, we can retrieve the constant data group and the coords values from the posterior group, and check if the tensor values at the time we call sample posterior predictive have changed. If they have changed, then we can mark them as volatile, and if they haven’t changed we don’t mark them as volatile. That should fix this issue. The problem I see is what should we do when the input isn’t an inference data object. Maybe we should assume that coords are not volatile but other tensors are?

lucianopaz avatar Aug 20 '22 05:08 lucianopaz

This is not a shape problem per se. The question is what should the default be when any input to a model variable is a shared variable.

with pm.Model() as m:
  mu = pm.MutableData("mu", 0)
  x = pm.Normal("x", mu)
  y = pm.Normal("y, x, observed=[10, 10, 10])

  idata = pm.sample()
  pp = pm.sample_posterior_predictive(idata)

In that example x will be resampled from the prior again, and so will be y by association, which has nothing to do with posterior predictive.

The reason we did these changes was to resample deterministics in GLM models where you have mu=x@beta and x is mutable. We should check if it's possible to restrict the volatile logic to deterministics above the likelihood or variables in var_names, and not to all variables in the model by default.

ricardoV94 avatar Aug 20 '22 09:08 ricardoV94

I'm having issues getting an out of sample prediction with coords changing for the out of sample data as well. Any progress on this issue?

jbh1128d1 avatar Aug 29 '22 16:08 jbh1128d1

For now the solution is to not use MutableData for inputs of unobserved distributions

ricardoV94 avatar Aug 30 '22 09:08 ricardoV94

Does that mean not to use pm.MutableData(....) or not to use pm.Data(...., mutable = True)? Can I use the latter with add_coords method?

jbh1128d1 avatar Aug 30 '22 13:08 jbh1128d1

Does that mean not to use pm.MutableData(....) or not to use pm.Data(...., mutable = True)? Can I use the latter with add_coords method?

pm.MutableData(....) is just a shortcut for pm.Data(...., mutable = True).

Check their implementation to see how they call add_coords under the hood..

michaelosthege avatar Aug 30 '22 17:08 michaelosthege

I apologize as I'm still confused. Am I understanding correctly, if I fit a model with coords, then I can't make out of sample predictions with coords that go with that sample?

jbh1128d1 avatar Sep 02 '22 20:09 jbh1128d1

Has there been any forward movement on this? I'm still not sure how to make predictions with new data without getting an error using the new coords.

jbh1128d1 avatar Sep 19 '22 13:09 jbh1128d1

Any more advances in the sample posterior problem? I'd need to add variable coords and I don't find any workaround for this issue.

sp-leonardo-alcivar avatar Sep 22 '22 21:09 sp-leonardo-alcivar

Hi, Is this the volatile logic in question? L1698 in sampling.py

Looking at the source and being a complete newbie when it comes to development and advanced python code, here are my thoughts:

If the nodes from fg = FunctionGraph(outputs=outputs, clone=False) are named or not. If they are, it should be easy enough to filter the nodes by the coords name after the volatility logic. Just like is done in the return value of the compile_forward_sampling_function with volatile_nodes - set(givens_dict). I am sort of familiar with set, but it seems that subtracting the set(coord_named_shared_variables) would work as a solution?

The reason we did these changes was to resample deterministics in GLM models where you have mu=x@beta and x is mutable. We should check if it's possible to restrict the volatile logic to deterministics above the likelihood or variables in var_names, and not to all variables in the model by default.

It looks like var_names gets filled with observed_rvs + auto_deterministics, I'm not sure at what point a unobserved_rv becomes a observed_rv. Possibly after sampling? If then, it seems var_names just needs to be filtered by the coords names?

If the goal is to update the coords via a shared variable, does that actually mean we need to resample? Since the data is being changed? My intuition is saying no, but there will be a loss of data(?).

If they're not named, SharedVariables could be tagged with a name attribute, which I have no idea if it is possible. Just looked, and it seems SharedVariables are an aesara thing. Would it be possible to tag SharedVariables with a name attribute anyway?

Yeah, we hadn’t thought about the sharedvalue coordinates when the designed the volatility propagation rule in sample posterior predictive. We will have to change it slightly. If the user supplies an inference data object, we can retrieve the constant data group and the coords values from the posterior group, and check if the tensor values at the time we call sample posterior predictive have changed. If they have changed, then we can mark them as volatile, and if they haven’t changed we don’t mark them as volatile. That should fix this issue. The problem I see is what should we do when the input isn’t an inference data object. Maybe we should assume that coords are not volatile but other tensors are?

I like this solution too, I'm just not advanced enough to implement it. I checked out the inference data from pm.sample, but I'm not seeing constant data, is this in the actual InferenceData object? I see that's arviz.

Thanks all.

jcstucki avatar Sep 23 '22 16:09 jcstucki

Hi, Is this the volatile logic in question? L1698 in sampling.py

Looking at the source and being a complete newbie when it comes to development and advanced python code, here are my thoughts:

If the nodes from fg = FunctionGraph(outputs=outputs, clone=False) are named or not. If they are, it should be easy enough to filter the nodes by the coords name after the volatility logic. Just like is done in the return value of the compile_forward_sampling_function with volatile_nodes - set(givens_dict). I am sort of familiar with set, but it seems that subtracting the set(coord_named_shared_variables) would work as a solution?

Hi @jcstucki, yes, that’s the function with the volatile logic. The problem that prevents doing what you suggest is that mutable coordinates might or might not have changed between having done inference and trying to make predictions. If the coordinates did change, then they should be volatile. If they did not change, they shouldn’t. At the moment, the code is naive and defensive. It assumes that any shared variable might have changed so it’s descendants should also be volatile. What you are proposing is to do the opposite, assume that they did not ever change. To get the correct behaviour, we will need to actually detect a change between inference and prediction, and use that to mark volatile shared variables.

I like this solution too, I'm just not advanced enough to implement it. I checked out the inference data from pm.sample, but I'm not seeing constant data, is this in the actual InferenceData object? I see that's arviz.

Yes, it’s an arviz inference data object. The code that generates it is in the backends folder, under arviz (can’t link because I’m on my phone). Anyway, we intend to get around to fixing this next week.

The work around at the moment is to use ConstantData objects for inference and retrodiction. Then, when you want to do predictions on new data, you will have to create a new model instance using MutableData, or potentially different rv names so that the values in the race are ignored when appropriate. Regrettably, I can’t explain hue to do this without seeing your model code

lucianopaz avatar Sep 23 '22 19:09 lucianopaz

It's still unclear what is the goal of the first example in the issue with regards to mutable coords of unobserved variables. If the idea is to sample and then change coords to e.g., ["A", "B", "C", "D"], and hope the model gets one extra dimension but those corresponding to the old ["A", "B", "C"], stay intact that's NOT going to happen. All dimensions would be resampled from the prior.

If the goal is to just reuse a model for the whole sampling+predictions without changing coords in between then a solution like what @lucianopaz proposed should be fine.

To extend a model one could instead add two groups of variables, one with fixed coords for sampling and one with mutable ones for predictions, and stack the two together. It's possible that you could give the prediction variables an initial size of zero (not sure about coords but it should be feasible) and then "grow" those for posterior predictive?

ricardoV94 avatar Sep 25 '22 18:09 ricardoV94

Having looked at a couple cases where this problem occurs, the main problem is the heavy use of index mappings to "expand" estimated parameters to match data. While it's true that the number of groups will not change (we cannot add "D"), the mapping between unobserved variables (e.g. intercepts) and rows of the observed data does change between inference and prediction time. One can imagine that the "training data" came from groups A, A, B, B, C, C, but we only want to predict for a single observation from group B. Given a usual setup, like:

group_idx, _ = pd.factorize(data.index)

with pm.Model() as model:
  group_map = pm.Data('group_idx', group_idx, mutable=True)
  # non-centered intercept prior
  mu = intercept[group_map]
  # likelihood, etc.

The user's expectation is that the index map group_idx will need to change to match new data, but this will in turn render the entire prior hierarchy over intercept volatile.

As I pointed out in a thread on the discourse, this can be avoided by working with design matrices. But I would hazard a guess that close to 100% of the example notebooks and video tutorials floating around out there use index mappings rather than design matrices. It means that people will commonly run into this issue if they are following along with e.g. the Radon example, and try to extend it to prediction use the set_data api. We've already seen it come up independently several times.

jessegrabowski avatar Sep 25 '22 19:09 jessegrabowski

@jessegrabowski what's the problem with indexing designs? Indexing operations (and its coords) are free to change for prediction in most examples I've seen (i.e. ~ mu = b[idx] + x*m[idx]), the problem is when you want to change coordinates of unobserved variables (b, m).

ricardoV94 avatar Sep 25 '22 19:09 ricardoV94

group_idx, _ = pd.factorize(data.index)

with pm.Model() as model:
  group_map = pm.Data('group_idx', group_idx, mutable=True)
  # non-centered intercept prior
  mu = intercept[group_map]
  # likelihood, etc.

The user's expectation is that the index map group_idx will need to change to match new data, but this will in turn render the entire prior hierarchy over intercept volatile.

This is the exact use case that we had in mind while thinking about volatility. The intercept random variable won’t be volatile because it doesn’t depend on the mutable data. Instead the deterministic mu variable will be volatile and its value will be recomputed using the group_map.

The case that we didn’t consider was when the dimensionality of the intercept random variable was determined by a shared variable. Just to make this point clear, the design matrix approach would also suffer there because you would have a design matrix multiplied by a variable sized vector.

As pointed out by @ricardoV94, the hardest thing to automate would be to determine which entries of the intercept should be taken from the prior and which from the posterior. Although difficult, aesara rewrites do enable us to do this, but I feel that it’s safer to begin only with a smarter default volatility for shared variables.

lucianopaz avatar Sep 25 '22 19:09 lucianopaz

To handle the mixed in and out of sample indexes, we would mainly have to rewrite the intercept_rv. The super simplified and pseudocodish way of doing this would look something like this

out_of_sample_rv = intercept.clone(). # with the correct resizing and matching inputs 
new_intercept = at.zeros(the_size_and_dtype)
new_intercept = at.set_subtensor(new_intercept[in_sample_idx], intercept)
new_intercept = at.set_subtensor(new_intercept[out_of_sample_idx], out_of_sample_rv)

lucianopaz avatar Sep 25 '22 19:09 lucianopaz

Well I went and tested it carefully, and @ricardoV94 and @lucianopaz, you're totally right. I will slink off back to my hole of shame. I now agree that this is a problem of communicating best practices, i,.e. making clear distinctions between fixed hierarchical maps, and flexible data maps, rather than an actual bug. Since the case of "mapping indices to deterministic variables" was clearly covered, I'm inclined to agree with @ricardoV94 that I can't think of a time when it's reasonable to expect the parameter shapes to change. Perhaps this is just a lack of imagination. But let it be known that my comments were based on a fundamental misunderstanding of the situation.

I feel bad for spewing a wall of text at that poor guy on the discourse rather than just telling him he needs to fix a couple variables.

The only other thing I'd say is that maybe there should be a warning raised by pm.sample_posterior_predictive if there are SharedVariables on any distribution sizes that lets the user know this can result in unexpected outcomes.

jessegrabowski avatar Sep 25 '22 19:09 jessegrabowski

No shame at all @jessegrabowski! I think that the whole point of open source development is to have engaged people, like yourself, speak up and iterate as a community. When I started to contribute to pymc3, I was quite clueless and made some horrendous pieces of work, like #2991. Thanks to the team and the community I learnt a lot and kept trying to do stuff and improve. I really don’t want you to lose confidence because of a slight misunderstanding. About warnings and logs, @ricardoV94 recently merged a pr to log the names of the variables that are being resampled. Do you think that’s good enough?

lucianopaz avatar Sep 25 '22 20:09 lucianopaz

Yeah I'd say so. It's come up enough lately that something should be explicitly stated when you sample the ppc. I think a little communication can go a long way here.

jessegrabowski avatar Sep 25 '22 21:09 jessegrabowski

This is the PR @lucianopaz mentioned about making posterior predictive more transparent: https://github.com/pymc-devs/pymc/pull/6142

About the feelings of shame, I heard the only way to get rid of them is to open a PR that closes an open issue. Maybe two :P

But honestly, misunderstandings tell us something very useful about how people are trying to use and misuse our tools so that we can try to make it better/more clear. Posterior predictive is clearly an area we have to document better (see this recent issue for example https://github.com/pymc-devs/pymc-examples/issues/418)

ricardoV94 avatar Sep 26 '22 01:09 ricardoV94

Want to do a PR?

On Sun, Sep 25, 2022, 23:05 jessegrabowski @.***> wrote:

Yeah I'd say so. It's come up enough lately that something should be explicitly stated when you sample the ppc. I think a little communication can go a long way here.

— Reply to this email directly, view it on GitHub https://github.com/pymc-devs/pymc/issues/6047#issuecomment-1257279534, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFETGAY7Y5NZYT5E6EKA5TWAC5A5ANCNFSM56E7CZMQ . You are receiving this because you are on a team that was mentioned.Message ID: @.***>

twiecki avatar Oct 11 '22 08:10 twiecki

What is the consensus on warnings/messages? It seems like there was at least some move to reduce their number from v3 to v4 (I'm thinking about how a warning is no longer given when r_hat are large after sampling, for example).

Would a warning on pm.sample_posterior_predictive that says something like "The shapes of the following random variables have changed, and will be sampled using their prior distributions: ..." be too "spammy" under the prevailing consensus on warnings?

jessegrabowski avatar Oct 11 '22 09:10 jessegrabowski

Would a warning on pm.sample_posterior_predictive that says something like "The shapes of the following random variables have changed, and will be sampled using their prior distributions: ..." be too "spammy" under the prevailing consensus on warnings?

Do you think that the information that is logged after this pr isn’t enough?

lucianopaz avatar Oct 11 '22 09:10 lucianopaz

and will be sampled using their prior distributions

That's not entirely correct. You may be resampling intermediate variables, still conditioned on the values of inferred hyperparameters.

Anyway, I think that logging the variables names as we do now removes any surprises. I don't think we need to tell users in words why those are being resampled.

ricardoV94 avatar Oct 11 '22 09:10 ricardoV94