pymc Imputation does not work in combination with pm.Data

Description

Even after https://github.com/pymc-devs/pymc3/pull/4439 imputations don't work in combination with pm.Data. This is because pm.Data creates a SharedVariable that currently does not support a np.ma.MaskedArray.

Almost identical to the example from #4437:

data = numpy.array([
    [1,2,3],
    [4,5,float("nan")],
    [7,8,9],
])
print(data)
with pymc3.Model():
    pymc3.Normal(
        "L",
        mu=pymc3.Normal("x", shape=data.shape),
        sd=10,
        observed=pm.Data("D", data),
        shape=data.shape
    )
    pymc3.sample()

Please provide the full traceback.

SamplingError                             Traceback (most recent call last)
<ipython-input-32-e837139c32a4> in <module>
     13         shape=data.shape
     14     )
---> 15     pymc3.sample()

c:\users\osthege\repos\pymc3-dev\pymc3\sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs)
    425     model = modelcontext(model)
    426     if start is None:
--> 427         check_start_vals(model.test_point, model)
    428     else:
    429         if isinstance(start, dict):

c:\users\osthege\repos\pymc3-dev\pymc3\util.py in check_start_vals(start, model)
    236                 "Initial evaluation of model at starting point failed!\n"
    237                 "Starting values:\n{}\n\n"
--> 238                 "Initial evaluation results:\n{}".format(elem, str(initial_eval))
    239             )
    240 

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'x': array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])}

Initial evaluation results:
x   -8.27
L     NaN
Name: Log-probability of test_point, dtype: float64

Versions and main components

PyMC3 Version: master (with #4439)
Theano Version: 1.1.0

Jan 25 '21 12:01 michaelosthege

Is this something that might be problematic if automatically inputed (setting new Data with more/less missing values than the initial data)?

Jan 25 '21 12:01 ricardoV94

The whole implementation for switching the data in an existing model is broken. Shape issues are one thing, but @ricardoV94 is right that with imputation it could get even worse. Imputation is realized by automatically changing the model graph. Switching out the data afterwards will almost certainly break it unless the mask is identical.

I see pm.Data primarily as a tool to get the data & coords nicely represented in the model graph and resulting InferenceData. We should probably separate those two use cases into something like pm.Data and pm.MutableData.

Jan 25 '21 12:01 michaelosthege

I have a hunch that we'll have to revisit the whole imputation feature under the new RandomVariable paradigm. After merging #4439 I'm fine with doing observed=pm.Data(...).container.data as a workaround.

Let's label this "wontfix" and revisit it for PyMC3 >=4.0.

Jan 25 '21 12:01 michaelosthege

Agreed 👌 Out of curiosity, what do you mean by "doing observed=pm.Data(...).container.data as a workaround" ?

Jan 25 '21 14:01 AlexAndorra

Agreed 👌 Out of curiosity, what do you mean by "doing observed=pm.Data(...).container.data as a workaround" ?

This way I can use the pm.Data to include my data into the InferenceData, but also use it for imputation. Only the graph I get from pm.model_to_graphviz now has the Data node disconnected, but I can live with that.

Jan 25 '21 14:01 michaelosthege

After learning about imputed variables, this feature would require a considerable change in the internals, since all the imputation logic is happening during the model definition.

Dec 22 '21 07:12 ricardoV94

Yes, we can't have support for the combination of SharedVariable+imputation. That brings me back to the proposal of distinguishing between pm.ConstantData and pm.MutableData or something like that.

Dec 22 '21 08:12 michaelosthege

Isn't the default observed ConstantData?

Dec 22 '21 08:12 ricardoV94

Observed are not automatically tracked with dims/coords and don't show up in model_to_graphviz. Also it is not always the case that these arrays become observed. Sometimes you need a vector of x values as an input to the regression and so far only by making it a pm.Data you can get it into the InferenceData..

Dec 22 '21 08:12 michaelosthege

I see... should we check for nans in pm.set_data and just prohibit it? Or too much of an edge case to be worth bothering?

Dec 22 '21 08:12 ricardoV94

Let's implement pm.ConstantData and pm.MutableData. Then we can check and warn on NaN and point users to ConstantData which creates a TensorConstant and registers it for dims/coords with the model. .

Could give us some speed-up too, because with constant data the shape is known..

Dec 22 '21 09:12 michaelosthege

I think this can be closed since one can now use pm.ConstantData

Jun 18 '22 10:06 michaelosthege

I think imputation will still fail with ConstantData

Jun 18 '22 11:06 ricardoV94

I think imputation will still fail with ConstantData

Oh because we're not yet checking NaN in tensors? Yeah sorry, I forgot about that. Good call to open it again 🙏

Jun 18 '22 11:06 michaelosthege

pymc pymc copied to clipboard

Imputation does not work in combination with pm.Data

Description

Versions and main components

pymc
pymc copied to clipboard