pymc
pymc copied to clipboard
Imputation does not work in combination with pm.Data
Description
Even after https://github.com/pymc-devs/pymc3/pull/4439 imputations don't work in combination with pm.Data.
This is because pm.Data creates a SharedVariable that currently does not support a np.ma.MaskedArray.
Almost identical to the example from #4437:
data = numpy.array([
[1,2,3],
[4,5,float("nan")],
[7,8,9],
])
print(data)
with pymc3.Model():
pymc3.Normal(
"L",
mu=pymc3.Normal("x", shape=data.shape),
sd=10,
observed=pm.Data("D", data),
shape=data.shape
)
pymc3.sample()
Please provide the full traceback.
SamplingError Traceback (most recent call last)
<ipython-input-32-e837139c32a4> in <module>
13 shape=data.shape
14 )
---> 15 pymc3.sample()
c:\users\osthege\repos\pymc3-dev\pymc3\sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs)
425 model = modelcontext(model)
426 if start is None:
--> 427 check_start_vals(model.test_point, model)
428 else:
429 if isinstance(start, dict):
c:\users\osthege\repos\pymc3-dev\pymc3\util.py in check_start_vals(start, model)
236 "Initial evaluation of model at starting point failed!\n"
237 "Starting values:\n{}\n\n"
--> 238 "Initial evaluation results:\n{}".format(elem, str(initial_eval))
239 )
240
SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'x': array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])}
Initial evaluation results:
x -8.27
L NaN
Name: Log-probability of test_point, dtype: float64
Versions and main components
- PyMC3 Version: master (with #4439)
- Theano Version: 1.1.0
Is this something that might be problematic if automatically inputed (setting new Data with more/less missing values than the initial data)?
The whole implementation for switching the data in an existing model is broken. Shape issues are one thing, but @ricardoV94 is right that with imputation it could get even worse. Imputation is realized by automatically changing the model graph. Switching out the data afterwards will almost certainly break it unless the mask is identical.
I see pm.Data primarily as a tool to get the data & coords nicely represented in the model graph and resulting InferenceData.
We should probably separate those two use cases into something like pm.Data and pm.MutableData.
I have a hunch that we'll have to revisit the whole imputation feature under the new RandomVariable paradigm. After merging #4439 I'm fine with doing observed=pm.Data(...).container.data as a workaround.
Let's label this "wontfix" and revisit it for PyMC3 >=4.0.
Agreed 👌 Out of curiosity, what do you mean by "doing observed=pm.Data(...).container.data as a workaround" ?
Agreed 👌 Out of curiosity, what do you mean by "doing observed=pm.Data(...).container.data as a workaround" ?
This way I can use the pm.Data to include my data into the InferenceData, but also use it for imputation. Only the graph I get from pm.model_to_graphviz now has the Data node disconnected, but I can live with that.
After learning about imputed variables, this feature would require a considerable change in the internals, since all the imputation logic is happening during the model definition.
Yes, we can't have support for the combination of SharedVariable+imputation.
That brings me back to the proposal of distinguishing between pm.ConstantData and pm.MutableData or something like that.
Isn't the default observed ConstantData?
Observed are not automatically tracked with dims/coords and don't show up in model_to_graphviz. Also it is not always the case that these arrays become observed. Sometimes you need a vector of x values as an input to the regression and so far only by making it a pm.Data you can get it into the InferenceData..
I see... should we check for nans in pm.set_data and just prohibit it? Or too much of an edge case to be worth bothering?
Let's implement pm.ConstantData and pm.MutableData. Then we can check and warn on NaN and point users to ConstantData which creates a TensorConstant and registers it for dims/coords with the model. .
Could give us some speed-up too, because with constant data the shape is known..
I think this can be closed since one can now use pm.ConstantData
I think imputation will still fail with ConstantData
I think imputation will still fail with
ConstantData
Oh because we're not yet checking NaN in tensors? Yeah sorry, I forgot about that. Good call to open it again 🙏