pymc
pymc copied to clipboard
Infer dims & coords from `xarray` variables passed to `pm.Data`
Description
The internal function pm.data.determine_coords is called when pm.Data(..., export_index_as_coords=True) is invoked.
Right now it can only infer coords from pd.Series, or pd.DataFrame, but it would be nice (and actually quite simple) to do the same for xarray.DataArray variables.
After all, the pm.Data variables become xarray.DataArray again, when they're stored in the InferenceData.
Deliverables
- Inferring
dimsfrom named pandas indexes - Inferring
dimsfrom DataArrays - Inferring
coordsfrom DataArrays - A unit test
- Possibly a (backwards compatible with deprecation warning) rename of the
export_index_as_coordstoinfer_dims_and_coords.
dim_name = value.index.name already gets the dimension names from index names, so would it suffice to just equate that to dims before returning, because this dims value name is already there in the coords that are being returned? (I might be missing stuff I've been away a while)
edit: this comment is only for the df part of the function
I suppose you're referring to these lines, which I recently edited in #5763 exactly to prepare for this issue.
https://github.com/pymc-devs/pymc/blob/b5a5b569779673914c9420c1cc0135b118505ff5/pymc/data.py#L511-L515
But yes, you captured the gist of it.
We'll just have to remember to not override non-None elements in the user-provided dims.
The xarray support will be just another if block with hasattr or isinstance checks.
is this still open? I'd like to take a look at this
is this still open? I'd like to take a look at this
Yes, feel free to take over
@michaelosthege , could you briefly explain the naming convention here? Because in data.py it seems like we have a bunch of classes, named conventionally with CamelCase, but then we have standalone methods named with nouns, that initialize the "Data" objects. I know it's not meant to be OO but I can't seem to get a hold of which style it goes for
Yes, so the pm.Data container was introduced with CamelCase naming to make it look like other PyMC model variables/distributions like pm.Normal.
pm.ConstantData and pm.MutableData reproduced that.
This might also be a relict of v3 designs that actually had a "container" object.
Technically, since v4 the pm.Normal("n") is also not a pm.Normal object:
>>> import pymc as pm
>>> with pm.Model():
... n = pm.Normal("n")
...
>>> isinstance(n, pm.Normal)
False
>>> type(n).__mro__
(<class 'pytensor.tensor.var.TensorVariable'>, <class 'pytensor.tensor.var._tensor_py_operators'>,
<class 'pytensor.graph.basic.Variable'>, <class 'pytensor.graph.basic.Node'>,
<class 'pytensor.graph.utils.MetaObject'>, <class 'typing.Generic'>,
<class 'object'>)
>>>
This is because of some pm.Distribution.__new__ magic that I don't understand.
There's a still open discussion about refactoring this to actual functions (like pm.Data), but the naming of the user-facing API will probably remain CamelCase for the foreseable future.
Thanks! Also since the data.py has changed quite a lot since last updates on this issue I'll just re-write the whole functionality, from what I can see we're down to 200 lines of code less than what it was back then, and I guess it's just easier than adapting the legacy version of the file