pymc Dims is not respected when broadcasting parameters

The dims argument is not respected when a random variable is created with parameters that broadcast to the desired output dimensions:

coords = {
     "dim1": pd.RangeIndex(10),
     "dim2": pd.RangeIndex(7)
}

with pm.Model(coords=coords) as model:
     mu = np.zeros((10, 1))
     x = pm.Normal("x", mu=mu, dims=("dim1", "dim2"))
     y = pm.Normal("y", mu=mu, dims=("dim1", "dim2"), shape=(10, 7))

x.eval().shape
# Returns (10, 1), when it should be (10, 7)

y.eval().shape
# Correctly returns (10, 7)

PyMC/PyMC3 Version: latest release 4.1.3
Aesara/Theano Version:
Python Version:
Operating system:
How did you install PyMC/PyMC3: conda

Jul 21 '22 03:07 aseyboldt

Looks like shape is not checked against dims when the random variable is created. However, when sampling, this results in an error in xarray when the inference data is created:

with pm.Model(coords=coords) as model:
     mu = np.zeros((10, 1))
    #  mu = np.zeros((10, 7))   # this works
     x = pm.Normal("x", mu=mu, dims=("dim1", "dim2"))
    samples = pm.sample_prior_predictive()

fails with

ValueError: conflicting sizes for dimension 'dim2': length 1 on the data but length 7 on coordinate 'dim2'

Traceback

Traceback (most recent call last):
  File "/Users/benedikt/code/pymc/nb_examples/issue_5993.py", line 15, in <module>
    samples = pm.sample_prior_predictive()
  File "/Users/benedikt/code/pymc/pymc/sampling.py", line 2267, in sample_prior_predictive
    return pm.to_inference_data(prior=prior, **ikwargs)
  File "/Users/benedikt/code/pymc/pymc/backends/arviz.py", line 591, in to_inference_data
    ).to_inference_data()
  File "/Users/benedikt/code/pymc/pymc/backends/arviz.py", line 522, in to_inference_data
    **self.priors_to_xarray(),
  File "/Users/benedikt/code/pymc/pymc/backends/arviz.py", line 471, in priors_to_xarray
    else dict_to_dataset(
  File "/Users/benedikt/miniforge3/envs/pymc-dev/lib/python3.10/site-packages/arviz/data/base.py", line 307, in dict_to_dataset
    data_vars[key] = numpy_to_data_array(
  File "/Users/benedikt/miniforge3/envs/pymc-dev/lib/python3.10/site-packages/arviz/data/base.py", line 254, in numpy_to_data_array
    return xr.DataArray(ary, coords=coords, dims=dims)
  File "/Users/benedikt/miniforge3/envs/pymc-dev/lib/python3.10/site-packages/xarray/core/dataarray.py", line 402, in __init__
    coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
  File "/Users/benedikt/miniforge3/envs/pymc-dev/lib/python3.10/site-packages/xarray/core/dataarray.py", line 152, in _infer_coords_and_dims
    raise ValueError(
ValueError: conflicting sizes for dimension 'dim2': length 1 on the data but length 7 on coordinate 'dim2'

I can also create a random variable with a shape that's inconsistent with the shape of the dims and will receive a similar error when sampling:

x = pm.Normal("x", dims=("dim1", "dim2"), shape=(2, 2))

How about resolving this by using the following logic if dims is provided?

If shape is not given, set shape to the value inferred from dims.
If shape is given, assert that shape is consistent with dims.

Perhaps this could be implemented in Distribution.__new___?

Jul 27 '22 09:07 bherwerth

@bherwerth I like your idea for the fix.

Jul 27 '22 09:07 twiecki

I had a closer look at the code and I see that there already is a logic in place for resizing RVs based on dims (and observed):

https://github.com/pymc-devs/pymc/blob/c8db06b125b1bde286cc00298239980a807d26fa/pymc/distributions/distribution.py#L263-L269

However, it seems dims only triggers a resize in case the number of dimensions change:

https://github.com/pymc-devs/pymc/blob/c8db06b125b1bde286cc00298239980a807d26fa/pymc/distributions/shape_utils.py#L525

To address this issue, we would need a resize also when shape changes and ndim remains the same. Given the existing mechanism of resizes, I wonder what's best:

Modify the existing logic after Distribution.dist is called and edit it so that it also works for shape changes with constant ndim.
Or: Set shape based on dims before the RV is created in Distribution.dist in https://github.com/pymc-devs/pymc/blob/c8db06b125b1bde286cc00298239980a807d26fa/pymc/distributions/distribution.py#L165

When reading the code in distribution.Distribution, I was wondering why the RV is first created with Distribution.dist and then resized later based on dims (or observed). Why is it done in a second separate step and not together with the resize in Distribution.dist? I am referring the this resize: https://github.com/pymc-devs/pymc/blob/c8db06b125b1bde286cc00298239980a807d26fa/pymc/distributions/distribution.py#L356

Jul 29 '22 22:07 bherwerth

CC @michaelosthege

Jul 30 '22 05:07 twiecki

This is not an easy problem.

The test case should also cover the case where the dims are initialized from MutableData:

with pm.Model() as pmodel:
    dat = pm.MutableData(
        "dat",
        np.ones((10, 7)),
        dims=("dim1", "dim2")
    )
    dlength_1 = pmodel.dim_lengths["dim1"]
    dlength_2 = pmodel.dim_lengths["dim2"]
    mu = np.zeros((10, 1))
    x = pm.Normal("y", mu=mu, dims=("dim1", "dim2"))
    y = pm.Normal("x", mu=mu, shape=(dlength_1, dlength_2))
print(x.eval().shape)
print(y.eval().shape)

Jul 30 '22 09:07 michaelosthege

This indeed looks like a more involved task. The current resize logic extends dimensions to the left, and I don't see a quick way to adapt it to cover broadcasting from dims. And also there are edge cases to consider like specifying dims that are not in model.dim_lenghts.

Would you agree in principle that it would make sense to determine shape from dims or observed before the call to Distribution.dist and to do all resizing in Distribution.dist? Could you explain why there is an additional resize after Distribution.dist in the current implementation? This looks a bit convoluted to me and perhaps the logic could be restructured by resizing only in one place.

Aug 01 '22 20:08 bherwerth

At this point the main reason for the additional resize is the Ellipsis feature by which the dimensionality of a variable can be extended. It's, cool but I never actually used it in practice. Maybe we should deprecate it again? Then one would have to specify all shape elements or dims elements every time. For shape that's annoying, but for dims that's probably the default use case anyhow.

There might be additional reasons for the resize related to observed.

Aug 01 '22 20:08 michaelosthege

I think it's reasonable to take the shape from dims (if shape isn't provided). Happy to remove ellipsis as well.

I don't see why observed would conflict with any of this.

Aug 10 '22 12:08 ricardoV94

I think something similar happens with observed as brought up (but a bit unrelated to the original issue in #5992)

Aug 10 '22 12:08 ricardoV94

Thanks for explaining and clarifying the reason for the additional resize.

I think it's reasonable to take the shape form dims (if shape isn't provided). Happy to remove ellipsis as well.

I understood @michaelosthege's suggestion to drop the ellipsis to apply if we wanted to condense the resize steps into a single one. I raised the question of why there are two resize steps when reading the code in Distribution. However, this is not directly related to the original problem. Maybe it's better to address this, if at all, separately from this issue.

Coming back to the original suggestion of inferring shape from dims, perhaps the following would be minimal change in Distribution.__new__ solving the issue while not interfering with the ellipsis feature:

        if dims is not None and "shape" not in kwargs:
            dims = convert_dims(dims)
            if len(dims) > 0 and dims[-1] != Ellipsis and all([d in model.dim_lengths for d in dims]):
                kwargs["shape"] = tuple(model.dim_lengths[d] for d in dims)

That is, we would only infer shape from dims if:

Ellipsis are not used (in this case the user's intent would anyway be to 'extend' implied dimensions to the left rather than to broadcast)
And specified dimensions are available in model.dim_lengths.

If you agree with this proposal, I can add a test case and create a PR.

Aug 10 '22 22:08 bherwerth

Just to give the big picture, the problem is that we wrongly assumed that dims cannot provide new shape information for dimensions that are already implied by the parameters. So we simply created an RV and then only looked at dims for dimensions that were missing on the left for resizing. However, we need to resize all dimensions because dims may broadcast the dimensions implied by the parameters as well.

This is equally true to resize from observed I think.

Aug 11 '22 03:08 ricardoV94

[...] the problem is that we wrongly assumed that dims cannot provide new shape information for dimensions that are already implied by the parameters.

this ☝️

The code snippet @bherwerth posted looks fine on the first glance, but to systematically fix this we also have to consider the case where there's a mix of already-known and not-yet-known dims.

And ultimately we have to update our decision of what is the intended behavior:

Should dims take priority over shape?
Should dims take priority over size?
What should happen if an already-known-dim has a different length than the corresponding entry in shape or size? How do we detect that? Do we detect this at all, or will dims just silently override whatever shape/size may say?

One solution to clean up this behavior-mess could be to replace shape, size, dims by one kwarg that is typed Sequence[Optional[Union[str, int, Variable]]].

IMO we should thoroughly think this through before touching it.

As a general advice to everybody: The best way to avoid shape issues (like the one this issue originally was about) is to not rely on "smart" broadcasting, but do all broadcasts/resizes intentionally.

Aug 11 '22 09:08 michaelosthege

About precedence: size/shape > dims > observed

Because they go from more flexible to more inflexible. Size/shape can be an arbitrary symbolic expression, dims and observed are usually static but can be resized, and observed is a bit more implicit and restricted to likelihoods.

Aug 11 '22 17:08 ricardoV94

pymc pymc copied to clipboard

Dims is not respected when broadcasting parameters

pymc
pymc copied to clipboard