pymc
pymc copied to clipboard
Dims is not respected when broadcasting parameters
The dims argument is not respected when a random variable is created with parameters that broadcast to the desired output dimensions:
coords = {
"dim1": pd.RangeIndex(10),
"dim2": pd.RangeIndex(7)
}
with pm.Model(coords=coords) as model:
mu = np.zeros((10, 1))
x = pm.Normal("x", mu=mu, dims=("dim1", "dim2"))
y = pm.Normal("y", mu=mu, dims=("dim1", "dim2"), shape=(10, 7))
x.eval().shape
# Returns (10, 1), when it should be (10, 7)
y.eval().shape
# Correctly returns (10, 7)
- PyMC/PyMC3 Version: latest release 4.1.3
- Aesara/Theano Version:
- Python Version:
- Operating system:
- How did you install PyMC/PyMC3: conda
Looks like shape is not checked against dims when the random variable is created. However, when sampling, this results in an error in xarray when the inference data is created:
with pm.Model(coords=coords) as model:
mu = np.zeros((10, 1))
# mu = np.zeros((10, 7)) # this works
x = pm.Normal("x", mu=mu, dims=("dim1", "dim2"))
samples = pm.sample_prior_predictive()
fails with
ValueError: conflicting sizes for dimension 'dim2': length 1 on the data but length 7 on coordinate 'dim2'
Traceback
Traceback (most recent call last):
File "/Users/benedikt/code/pymc/nb_examples/issue_5993.py", line 15, in <module>
samples = pm.sample_prior_predictive()
File "/Users/benedikt/code/pymc/pymc/sampling.py", line 2267, in sample_prior_predictive
return pm.to_inference_data(prior=prior, **ikwargs)
File "/Users/benedikt/code/pymc/pymc/backends/arviz.py", line 591, in to_inference_data
).to_inference_data()
File "/Users/benedikt/code/pymc/pymc/backends/arviz.py", line 522, in to_inference_data
**self.priors_to_xarray(),
File "/Users/benedikt/code/pymc/pymc/backends/arviz.py", line 471, in priors_to_xarray
else dict_to_dataset(
File "/Users/benedikt/miniforge3/envs/pymc-dev/lib/python3.10/site-packages/arviz/data/base.py", line 307, in dict_to_dataset
data_vars[key] = numpy_to_data_array(
File "/Users/benedikt/miniforge3/envs/pymc-dev/lib/python3.10/site-packages/arviz/data/base.py", line 254, in numpy_to_data_array
return xr.DataArray(ary, coords=coords, dims=dims)
File "/Users/benedikt/miniforge3/envs/pymc-dev/lib/python3.10/site-packages/xarray/core/dataarray.py", line 402, in __init__
coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
File "/Users/benedikt/miniforge3/envs/pymc-dev/lib/python3.10/site-packages/xarray/core/dataarray.py", line 152, in _infer_coords_and_dims
raise ValueError(
ValueError: conflicting sizes for dimension 'dim2': length 1 on the data but length 7 on coordinate 'dim2'
I can also create a random variable with a shape that's inconsistent with the shape of the dims and will receive a similar error when sampling:
x = pm.Normal("x", dims=("dim1", "dim2"), shape=(2, 2))
How about resolving this by using the following logic if dims is provided?
- If
shapeis not given, setshapeto the value inferred fromdims. - If
shapeis given, assert thatshapeis consistent withdims.
Perhaps this could be implemented in Distribution.__new___?
@bherwerth I like your idea for the fix.
I had a closer look at the code and I see that there already is a logic in place for resizing RVs based on dims (and observed):
https://github.com/pymc-devs/pymc/blob/c8db06b125b1bde286cc00298239980a807d26fa/pymc/distributions/distribution.py#L263-L269
However, it seems dims only triggers a resize in case the number of dimensions change:
https://github.com/pymc-devs/pymc/blob/c8db06b125b1bde286cc00298239980a807d26fa/pymc/distributions/shape_utils.py#L525
To address this issue, we would need a resize also when shape changes and ndim remains the same. Given the existing mechanism of resizes, I wonder what's best:
- Modify the existing logic after
Distribution.distis called and edit it so that it also works for shape changes with constantndim. - Or: Set
shapebased ondimsbefore the RV is created inDistribution.distin https://github.com/pymc-devs/pymc/blob/c8db06b125b1bde286cc00298239980a807d26fa/pymc/distributions/distribution.py#L165
When reading the code in distribution.Distribution, I was wondering why the RV is first created with Distribution.dist and then resized later based on dims (or observed). Why is it done in a second separate step and not together with the resize in Distribution.dist? I am referring the this resize: https://github.com/pymc-devs/pymc/blob/c8db06b125b1bde286cc00298239980a807d26fa/pymc/distributions/distribution.py#L356
CC @michaelosthege
This is not an easy problem.
The test case should also cover the case where the dims are initialized from MutableData:
with pm.Model() as pmodel:
dat = pm.MutableData(
"dat",
np.ones((10, 7)),
dims=("dim1", "dim2")
)
dlength_1 = pmodel.dim_lengths["dim1"]
dlength_2 = pmodel.dim_lengths["dim2"]
mu = np.zeros((10, 1))
x = pm.Normal("y", mu=mu, dims=("dim1", "dim2"))
y = pm.Normal("x", mu=mu, shape=(dlength_1, dlength_2))
print(x.eval().shape)
print(y.eval().shape)
This indeed looks like a more involved task. The current resize logic extends dimensions to the left, and I don't see a quick way to adapt it to cover broadcasting from dims. And also there are edge cases to consider like specifying dims that are not in model.dim_lenghts.
Would you agree in principle that it would make sense to determine shape from dims or observed before the call to Distribution.dist and to do all resizing in Distribution.dist? Could you explain why there is an additional resize after Distribution.dist in the current implementation? This looks a bit convoluted to me and perhaps the logic could be restructured by resizing only in one place.
At this point the main reason for the additional resize is the Ellipsis feature by which the dimensionality of a variable can be extended.
It's, cool but I never actually used it in practice. Maybe we should deprecate it again? Then one would have to specify all shape elements or dims elements every time. For shape that's annoying, but for dims that's probably the default use case anyhow.
There might be additional reasons for the resize related to observed.
I think it's reasonable to take the shape from dims (if shape isn't provided). Happy to remove ellipsis as well.
I don't see why observed would conflict with any of this.
I think something similar happens with observed as brought up (but a bit unrelated to the original issue in #5992)
Thanks for explaining and clarifying the reason for the additional resize.
I think it's reasonable to take the shape form dims (if shape isn't provided). Happy to remove ellipsis as well.
I understood @michaelosthege's suggestion to drop the ellipsis to apply if we wanted to condense the resize steps into a single one. I raised the question of why there are two resize steps when reading the code in Distribution. However, this is not directly related to the original problem. Maybe it's better to address this, if at all, separately from this issue.
Coming back to the original suggestion of inferring shape from dims, perhaps the following would be minimal change in Distribution.__new__ solving the issue while not interfering with the ellipsis feature:
if dims is not None and "shape" not in kwargs:
dims = convert_dims(dims)
if len(dims) > 0 and dims[-1] != Ellipsis and all([d in model.dim_lengths for d in dims]):
kwargs["shape"] = tuple(model.dim_lengths[d] for d in dims)
That is, we would only infer shape from dims if:
- Ellipsis are not used (in this case the user's intent would anyway be to 'extend' implied dimensions to the left rather than to broadcast)
- And specified dimensions are available in
model.dim_lengths.
If you agree with this proposal, I can add a test case and create a PR.
Just to give the big picture, the problem is that we wrongly assumed that dims cannot provide new shape information for dimensions that are already implied by the parameters. So we simply created an RV and then only looked at dims for dimensions that were missing on the left for resizing. However, we need to resize all dimensions because dims may broadcast the dimensions implied by the parameters as well.
This is equally true to resize from observed I think.
[...] the problem is that we wrongly assumed that dims cannot provide new shape information for dimensions that are already implied by the parameters.
this ☝️
The code snippet @bherwerth posted looks fine on the first glance, but to systematically fix this we also have to consider the case where there's a mix of already-known and not-yet-known dims.
And ultimately we have to update our decision of what is the intended behavior:
- Should
dimstake priority overshape? - Should
dimstake priority oversize? - What should happen if an already-known-dim has a different length than the corresponding entry in
shapeorsize? How do we detect that? Do we detect this at all, or willdimsjust silently override whatevershape/sizemay say?
One solution to clean up this behavior-mess could be to replace shape, size, dims by one kwarg that is typed Sequence[Optional[Union[str, int, Variable]]].
IMO we should thoroughly think this through before touching it.
As a general advice to everybody: The best way to avoid shape issues (like the one this issue originally was about) is to not rely on "smart" broadcasting, but do all broadcasts/resizes intentionally.
About precedence: size/shape > dims > observed
Because they go from more flexible to more inflexible. Size/shape can be an arbitrary symbolic expression, dims and observed are usually static but can be resized, and observed is a bit more implicit and restricted to likelihoods.