xarray
xarray copied to clipboard
Enhancement of xarray.Dataset.from_dataframe
Is your feature request related to a problem?
The current xarray.Dataset.from_dataframe method converts DataFrame columns corresponding to non-index coordinates into variables as explained in the user-guide.
This solution is not optimal because it does not recover the structure of the initial data. It also creates large datasets.
The user-guide example is below:
In [1]: ds = xr.Dataset(
{"foo": (("x", "y"), np.random.randn(2, 3))},
coords={
"x": [10, 20],
"y": ["a", "b", "c"],
"along_x": ("x", np.random.randn(2)),
"scalar": 123,
},
)
ds
Out[1]:
<xarray.Dataset> Size: 88B
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int32 8B 10 20
* y (y) <U1 12B 'a' 'b' 'c'
along_x (x) float64 16B -0.03376 0.8059
scalar int32 4B 123
Data variables:
foo (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287
In [2]: df = ds.to_dataframe()
xr.Dataset.from_dataframe(df)
Out[2]:
<xarray.Dataset> Size: 152B
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int32 8B 10 20
* y (y) object 24B 'a' 'b' 'c'
Data variables:
foo (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287
along_x (x, y) float64 48B -0.03376 -0.03376 -0.03376 0.8059 0.8059 0.8059
scalar (x, y) int32 24B 123 123 123 123 123 123
/
Describe the solution you'd like
If we analyse the relationships between columns, we can distinguish between variables, dims coordinates and non_dims coordinates.
In the example above, the round-trip conversion with npd return also the same dataset:
In [3]: df.npd.to_xarray()
Out[3]:
<xarray.Dataset> Size: 88B
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int32 8B 10 20
* y (y) <U1 12B 'a' 'b' 'c'
along_x (x) float64 16B -0.03376 0.8059
scalar int32 4B 123
Data variables:
foo (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287
Note:
npdis the ntv_pandas package (present in the pandas ecosystem). This package is capable of converting complex DataFrame (see examples).
Describe alternatives you've considered
Three options are available to you to have an efficient converter,
- option 1: maintain the current
xarray.Dataset.from_dataframeand use thenpdthird-party solution to have an optimized converter - option 2: reuse the
analysispackage to find dims, coordinates and variables, then modify thexarray.Dataset.from_dataframemethod to generate a dataset, - option 3: include the
analysisfunctions in thexarray.Dataset.from_dataframemethod
It seems to me that the option 3 is complex. The option 1 and option 2 are possible
Additional context
The analysis (package tab_analysis) applied to the example above gives the results below:
In [4]: analys = df.reset_index().npd.analysis(distr=True)
analys.partitions()
Out[4]: [['x', 'y'], ['foo']] # two partitions (dims) are found
In [5]: analys.field_partition() # use the first partition : ['x', 'y']
Out[5]:
{'primary': ['x', 'y'],
'secondary': ['along_x'],
'mixte': [],
'unique': ['scalar'],
'variable': ['foo']}
In [6]: analys.relation_partition()
Out[6]: {'x': ['x'], 'y': ['y'], 'along_x': ['x'], 'scalar': [], 'foo': ['x', 'y']}
This looks very cool!
I think the first thing we could do is add a link to the library from the documentation — at least the from_dataframe method...
@max-sixty
Thank-you Maximilian for your quick response !
Yes it's a good idea, do you need any additional information for this ?
By the way, i'm looking to see if another theory of tabular structure analysis (see presentation) exists but I can't find references. Do you have some contacts or some references about that ?
Yes it's a good idea, do you need any additional information for this ?
This would be a PR you could make to the docs!
OK, that's perfect!
I will prepare a modification of the 'doc/user-guide/pandas.rst' file and then include it in a PR.
Can you confirm that it is not necessary to create a development environment?
Can you confirm that it is not necessary to create a development environment?
No it shouldn't be required!