xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Enhancement of xarray.Dataset.from_dataframe

Open loco-philippe opened this issue 1 year ago • 5 comments

Is your feature request related to a problem?

The current xarray.Dataset.from_dataframe method converts DataFrame columns corresponding to non-index coordinates into variables as explained in the user-guide.

This solution is not optimal because it does not recover the structure of the initial data. It also creates large datasets.

The user-guide example is below:

In [1]: ds = xr.Dataset(
              {"foo": (("x", "y"), np.random.randn(2, 3))},
              coords={
                  "x": [10, 20],
                  "y": ["a", "b", "c"],
                  "along_x": ("x", np.random.randn(2)),
                  "scalar": 123,
              },
         )
         ds
Out[1]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

In [2]: df = ds.to_dataframe()
        xr.Dataset.from_dataframe(df)
Out[2]:
<xarray.Dataset> Size: 152B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) object 24B 'a' 'b' 'c'
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287
    along_x  (x, y) float64 48B -0.03376 -0.03376 -0.03376 0.8059 0.8059 0.8059
    scalar   (x, y) int32 24B 123 123 123 123 123 123

/

Describe the solution you'd like

If we analyse the relationships between columns, we can distinguish between variables, dims coordinates and non_dims coordinates.

In the example above, the round-trip conversion with npd return also the same dataset:

In [3]: df.npd.to_xarray()
Out[3]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 8B 10 20
  * y        (y) <U1 12B 'a' 'b' 'c'
    along_x  (x) float64 16B -0.03376 0.8059
    scalar   int32 4B 123
Data variables:
    foo      (x, y) float64 48B -1.811 0.4769 0.1201 -0.0352 -1.61 -1.287

Note:

  • npd is the ntv_pandas package (present in the pandas ecosystem). This package is capable of converting complex DataFrame (see examples).

Describe alternatives you've considered

Three options are available to you to have an efficient converter,

  • option 1: maintain the current xarray.Dataset.from_dataframe and use the npd third-party solution to have an optimized converter
  • option 2: reuse the analysis package to find dims, coordinates and variables, then modify the xarray.Dataset.from_dataframe method to generate a dataset,
  • option 3: include the analysis functions in the xarray.Dataset.from_dataframe method

It seems to me that the option 3 is complex. The option 1 and option 2 are possible

Additional context

The analysis (package tab_analysis) applied to the example above gives the results below:

In [4]: analys = df.reset_index().npd.analysis(distr=True)
        analys.partitions()
Out[4]: [['x', 'y'], ['foo']] # two partitions (dims) are found

In [5]: analys.field_partition() # use the first partition : ['x', 'y']
Out[5]: 
{'primary': ['x', 'y'],
 'secondary': ['along_x'],
 'mixte': [],
 'unique': ['scalar'],
 'variable': ['foo']}

In [6]: analys.relation_partition()
Out[6]: {'x': ['x'], 'y': ['y'], 'along_x': ['x'], 'scalar': [], 'foo': ['x', 'y']}

loco-philippe avatar May 07 '24 21:05 loco-philippe

This looks very cool!

I think the first thing we could do is add a link to the library from the documentation — at least the from_dataframe method...

max-sixty avatar May 07 '24 22:05 max-sixty

@max-sixty

Thank-you Maximilian for your quick response !

Yes it's a good idea, do you need any additional information for this ?

By the way, i'm looking to see if another theory of tabular structure analysis (see presentation) exists but I can't find references. Do you have some contacts or some references about that ?

loco-philippe avatar May 08 '24 08:05 loco-philippe

Yes it's a good idea, do you need any additional information for this ?

This would be a PR you could make to the docs!

max-sixty avatar May 08 '24 16:05 max-sixty

OK, that's perfect!

I will prepare a modification of the 'doc/user-guide/pandas.rst' file and then include it in a PR.

Can you confirm that it is not necessary to create a development environment?

loco-philippe avatar May 08 '24 20:05 loco-philippe

Can you confirm that it is not necessary to create a development environment?

No it shouldn't be required!

max-sixty avatar May 08 '24 21:05 max-sixty