pymc
pymc copied to clipboard
Remove `samples` and `keep_size` from posterior_predictive
A bit more work than what was needed for #5772, mostly because we use samples
in several tests.
Quoting @OriolAbril
imo kill everything,
size
,keep_size
andsamples
(to get less samples one can always provide a thinned posterior as shown in the example)
Hi, I'd be interested in working on this. I'm new to contributing to pymc
so would appreciate any pointers on what to target for this issue.
Seems like we'll want to update code and docstrings in pymc/sampling.py
, as well as any tests in pymc/tests/
that call on sample_posterior_predictive
with these args. Anything else?
That sounds right. I think the examples in the docs (for example but probably not restricted to: https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/posterior_predictive.html) are already not using either argument, but it'd be best to check too.
Thanks!
~~Should we also remove the samples
argument from sample_posterior_predictive_w()
?~~ Edit: never mind, I see it's a work in progress for v4 (#5800)
Another question: if the user passes return_inferencedata=False
, we can either reshape the array elements of the return value (which is type Dict[str, ndarray]) to be shape (chain, draw, ...), or just leave them as-is (flattened arrays). Which should we do?
The default behavior currently is to reshape (since keep_size
defaults to True) but just wanted to make sure, given that we're nixing keep_size
. The tests currently make use of both, but it's much more weighted towards reshaping.
I think we can also remove this and always return InferenceData. Unlike with pm.sample
where there is some info that is still present only in the old trace that is not the case here. What do you think? cc @ricardoV94 @michaelosthege
I would prefer to keep having the trace option
Here the options are dict and InferenceData, trace is only on pm.sample (which I agree can't be removed yet)
The default behavior currently is to reshape (since
keep_size
defaults to True) but just wanted to make sure, given that we're nixingkeep_size
. The tests currently make use of both, but it's much more weighted towards reshaping.
Reshaping sounds alright.
Here the options are dict and InferenceData, trace is only on pm.sample (which I agree can't be removed yet.
Is a dict also what we get from prior predictive with return_inferencedata = False
?
yep, we could get rid of the dict output in prior sampling I think
Unless there is a large maintenance burden I rather keep it, I was just asking to understand if there was something special about posterior predictive.
I understand a lot of people prefer InferenceData, but I find it a PITA to be honest.
I see the benefits of a trace object with a simpler API, but the current implementation of BaseTrace/MultiTrace has some terrible incoherences.
We'll always have a non-InferenceData
data structure to manage draws at MCMC-time.
If that thing has a coherent API, I think we can offer to return it to the user.
@michaelosthege this is about the dict alternative from prior/posterior_predictive. I also thought it was a Multitrace object, but it's even simpler than that.
I don't see the problem woth a simple dict :) Also, I can come up with cases where the idata would fail, such as dynamic sized variables, although those are overall not supported in PyMC
Happy to incorporate the decision either way in the PR. I can get back to it early this coming week.
A bit of feedback as a user; the fact that the return_inferencedata
argument behaves differently in different methods (ie False
to the mcmc sampler returns multitrace, False
to the predictive samplers returns dict) was a bit confusing at first. It seems like really they're different things, or at least the behavior of "don't return inferencedata" if kept could be made more clear at the argument level.
Also, not sure how important this is, but it seems that nearly all of the tests that do posterior predictive sampling currently use return_inferencedata=False
. Is this fine left as-is or should any be updated to use idata?
Since return_inferencedata=True
is now the default, I agree that the naming could be improved.
Something like:
-
pm.sample(return_mtrace=False)
-
pm.sample_p*_predictive(return_dict=False)
w.r.t. to the original description of the issue, I'm also in favor of removing samples, keep_size
arguments from sample_posterior_predictive
.
Then thinned sampling will only be supported with xarray
inputs, as shown in the docstring:
https://github.com/pymc-devs/pymc/blob/3f2afb24df9351e0881eeae7994603f2ec16e31d/pymc/sampling.py#L1812-L1819
Most tests use return_inferencedata=False
simply because they were not migrated.
But skipping the InferenceData
conversion is also slightly faster, so I think we should stick to return_inferencedata=False
/return_dict=True
in the test suite.
Likewise many tests use the pm.sample(return_inferencedata=False)
option because working with the MultiTrace
representation is less verbose and skips the conversion step.