pymc icon indicating copy to clipboard operation
pymc copied to clipboard

Infer dims & coords from `xarray` variables passed to `pm.Data`

Open michaelosthege opened this issue 3 years ago • 2 comments

Description

The internal function pm.data.determine_coords is called when pm.Data(..., export_index_as_coords=True) is invoked.

Right now it can only infer coords from pd.Series, or pd.DataFrame, but it would be nice (and actually quite simple) to do the same for xarray.DataArray variables.

After all, the pm.Data variables become xarray.DataArray again, when they're stored in the InferenceData.

Deliverables

  • Inferring dims from named pandas indexes
  • Inferring dims from DataArrays
  • Inferring coords from DataArrays
  • A unit test
  • Possibly a (backwards compatible with deprecation warning) rename of the export_index_as_coords to infer_dims_and_coords.

michaelosthege avatar May 21 '22 19:05 michaelosthege

dim_name = value.index.name already gets the dimension names from index names, so would it suffice to just equate that to dims before returning, because this dims value name is already there in the coords that are being returned? (I might be missing stuff I've been away a while)

edit: this comment is only for the df part of the function

mjhajharia avatar May 23 '22 16:05 mjhajharia

I suppose you're referring to these lines, which I recently edited in #5763 exactly to prepare for this issue.

https://github.com/pymc-devs/pymc/blob/b5a5b569779673914c9420c1cc0135b118505ff5/pymc/data.py#L511-L515

But yes, you captured the gist of it.

We'll just have to remember to not override non-None elements in the user-provided dims.

The xarray support will be just another if block with hasattr or isinstance checks.

michaelosthege avatar May 23 '22 16:05 michaelosthege

is this still open? I'd like to take a look at this

michaelraczycki avatar Feb 08 '23 10:02 michaelraczycki

is this still open? I'd like to take a look at this

Yes, feel free to take over

michaelosthege avatar Feb 08 '23 12:02 michaelosthege

@michaelosthege , could you briefly explain the naming convention here? Because in data.py it seems like we have a bunch of classes, named conventionally with CamelCase, but then we have standalone methods named with nouns, that initialize the "Data" objects. I know it's not meant to be OO but I can't seem to get a hold of which style it goes for

michaelraczycki avatar Feb 10 '23 11:02 michaelraczycki

Yes, so the pm.Data container was introduced with CamelCase naming to make it look like other PyMC model variables/distributions like pm.Normal. pm.ConstantData and pm.MutableData reproduced that. This might also be a relict of v3 designs that actually had a "container" object.

Technically, since v4 the pm.Normal("n") is also not a pm.Normal object:

>>> import pymc as pm
>>> with pm.Model():
...     n = pm.Normal("n")
...
>>> isinstance(n, pm.Normal)
False
>>> type(n).__mro__
(<class 'pytensor.tensor.var.TensorVariable'>, <class 'pytensor.tensor.var._tensor_py_operators'>,
<class 'pytensor.graph.basic.Variable'>, <class 'pytensor.graph.basic.Node'>,
<class 'pytensor.graph.utils.MetaObject'>, <class 'typing.Generic'>,
<class 'object'>)
>>>

This is because of some pm.Distribution.__new__ magic that I don't understand.

There's a still open discussion about refactoring this to actual functions (like pm.Data), but the naming of the user-facing API will probably remain CamelCase for the foreseable future.

michaelosthege avatar Feb 10 '23 12:02 michaelosthege

Thanks! Also since the data.py has changed quite a lot since last updates on this issue I'll just re-write the whole functionality, from what I can see we're down to 200 lines of code less than what it was back then, and I guess it's just easier than adapting the legacy version of the file

michaelraczycki avatar Feb 10 '23 12:02 michaelraczycki