pymc icon indicating copy to clipboard operation
pymc copied to clipboard

Remove `samples` and `keep_size` from posterior_predictive

Open ricardoV94 opened this issue 2 years ago • 16 comments

A bit more work than what was needed for #5772, mostly because we use samples in several tests.

Quoting @OriolAbril

imo kill everything, size, keep_size and samples (to get less samples one can always provide a thinned posterior as shown in the example)

ricardoV94 avatar May 17 '22 13:05 ricardoV94

Hi, I'd be interested in working on this. I'm new to contributing to pymc so would appreciate any pointers on what to target for this issue.

Seems like we'll want to update code and docstrings in pymc/sampling.py, as well as any tests in pymc/tests/ that call on sample_posterior_predictive with these args. Anything else?

covertg avatar Jun 08 '22 15:06 covertg

That sounds right. I think the examples in the docs (for example but probably not restricted to: https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/posterior_predictive.html) are already not using either argument, but it'd be best to check too.

OriolAbril avatar Jun 08 '22 17:06 OriolAbril

Thanks! ~~Should we also remove the samples argument from sample_posterior_predictive_w()?~~ Edit: never mind, I see it's a work in progress for v4 (#5800)

covertg avatar Jun 10 '22 14:06 covertg

Another question: if the user passes return_inferencedata=False, we can either reshape the array elements of the return value (which is type Dict[str, ndarray]) to be shape (chain, draw, ...), or just leave them as-is (flattened arrays). Which should we do?

The default behavior currently is to reshape (since keep_size defaults to True) but just wanted to make sure, given that we're nixing keep_size. The tests currently make use of both, but it's much more weighted towards reshaping.

covertg avatar Jun 10 '22 15:06 covertg

I think we can also remove this and always return InferenceData. Unlike with pm.sample where there is some info that is still present only in the old trace that is not the case here. What do you think? cc @ricardoV94 @michaelosthege

OriolAbril avatar Jun 10 '22 19:06 OriolAbril

I would prefer to keep having the trace option

ricardoV94 avatar Jun 10 '22 19:06 ricardoV94

Here the options are dict and InferenceData, trace is only on pm.sample (which I agree can't be removed yet)

OriolAbril avatar Jun 10 '22 19:06 OriolAbril

The default behavior currently is to reshape (since keep_size defaults to True) but just wanted to make sure, given that we're nixing keep_size. The tests currently make use of both, but it's much more weighted towards reshaping.

Reshaping sounds alright.

ricardoV94 avatar Jun 10 '22 20:06 ricardoV94

Here the options are dict and InferenceData, trace is only on pm.sample (which I agree can't be removed yet.

Is a dict also what we get from prior predictive with return_inferencedata = False?

ricardoV94 avatar Jun 10 '22 20:06 ricardoV94

yep, we could get rid of the dict output in prior sampling I think

OriolAbril avatar Jun 10 '22 20:06 OriolAbril

Unless there is a large maintenance burden I rather keep it, I was just asking to understand if there was something special about posterior predictive.

I understand a lot of people prefer InferenceData, but I find it a PITA to be honest.

ricardoV94 avatar Jun 10 '22 20:06 ricardoV94

I see the benefits of a trace object with a simpler API, but the current implementation of BaseTrace/MultiTrace has some terrible incoherences.

We'll always have a non-InferenceData data structure to manage draws at MCMC-time. If that thing has a coherent API, I think we can offer to return it to the user.

michaelosthege avatar Jun 11 '22 01:06 michaelosthege

@michaelosthege this is about the dict alternative from prior/posterior_predictive. I also thought it was a Multitrace object, but it's even simpler than that.

I don't see the problem woth a simple dict :) Also, I can come up with cases where the idata would fail, such as dynamic sized variables, although those are overall not supported in PyMC

ricardoV94 avatar Jun 11 '22 06:06 ricardoV94

Happy to incorporate the decision either way in the PR. I can get back to it early this coming week.

A bit of feedback as a user; the fact that the return_inferencedata argument behaves differently in different methods (ie False to the mcmc sampler returns multitrace, False to the predictive samplers returns dict) was a bit confusing at first. It seems like really they're different things, or at least the behavior of "don't return inferencedata" if kept could be made more clear at the argument level.

covertg avatar Jun 12 '22 11:06 covertg

Also, not sure how important this is, but it seems that nearly all of the tests that do posterior predictive sampling currently use return_inferencedata=False. Is this fine left as-is or should any be updated to use idata?

covertg avatar Jun 12 '22 11:06 covertg

Since return_inferencedata=True is now the default, I agree that the naming could be improved. Something like:

  • pm.sample(return_mtrace=False)
  • pm.sample_p*_predictive(return_dict=False)

w.r.t. to the original description of the issue, I'm also in favor of removing samples, keep_size arguments from sample_posterior_predictive. Then thinned sampling will only be supported with xarray inputs, as shown in the docstring: https://github.com/pymc-devs/pymc/blob/3f2afb24df9351e0881eeae7994603f2ec16e31d/pymc/sampling.py#L1812-L1819

Most tests use return_inferencedata=False simply because they were not migrated. But skipping the InferenceData conversion is also slightly faster, so I think we should stick to return_inferencedata=False/return_dict=True in the test suite.

Likewise many tests use the pm.sample(return_inferencedata=False) option because working with the MultiTrace representation is less verbose and skips the conversion step.

michaelosthege avatar Jun 12 '22 12:06 michaelosthege