xarray User-guide - pandas : Add alternative to xarray.Dataset.from

trafficstars

This PR follow the issue #9015 as proposed by @max-sixty.

I added an additional section in the pandas.rst file to provide a third-party pandas interface alternative that is lossless and reversible.

The main contribution is the ability to find the multidimensional structure hidden by the tabular structure.

May 10 '24 08:05 loco-philippe

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. If you have questions, some answers may be found in our contributing guidelines.

May 10 '24 08:05 welcome[bot]

Overall I think this looks reasonable! I don't have strong views on the substance, others should comment if they have views...

We'll need to add ntv_pandas to the docs dependencies for the docs tests to pass.

May 10 '24 16:05 max-sixty

@max-sixty

Thanks for the answer to the question I was going to ask!

Would it also be useful to add ntv_pandas in the ecosystem.rst file (for example in the 'Extend xarray capabilities' category) ?

May 10 '24 20:05 loco-philippe

Would it also be useful to add ntv_pandas in the ecosystem.rst file (for example in the 'Extend xarray capabilities' category) ?

Yes, please go ahead.

May 10 '24 20:05 mathause

Error Read the Docs :

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[11], line 1
----> 1 import ntv_pandas as npd

ModuleNotFoundError: No module named 'ntv_pandas'

<<<-------------------------------------------------------------------------

Is it because the package name (ntv-pandas) is different from the module name (ntv_pandas) ?

May 13 '24 22:05 loco-philippe

ModuleNotFoundError: No module named 'ntv_pandas'

Is it available on conda? Otherwise if only on pip then place it at the bottom in the pip section

May 13 '24 23:05 max-sixty

Now you get a syntax error because arr[*idx] is only available in python 3.11 while our docs are in python 3.10 (and your ntv-numpy package is python 3.9+ - you should add tests for your minimal python version and dependencies).

https://github.com/loco-philippe/ntv-numpy/blob/4b57b8cc1bfab749c01ddf7edbc38a9ef53623df/ntv_numpy/xconnector.py#L268-L269

May 14 '24 09:05 mathause

Checking again - you absolutely need to start testing your packages continuously. I am reluctant to 'endorse' a package that does not have a CI pipeline. Let us know if you need support with that.

May 14 '24 09:05 mathause

@mathause

This bug is clearly unacceptable (I'm ashamed) !!

This demonstrates that I now need to take the time to have a robust CI process. I will first use GitHub Actions to have a build and test pipeline (if you have minimum requirements to respect to endorse packages in Xarray or other advice, I'm interested!)

For the current PR, two solutions:

solution 1: I correct the identified bug and I check that all the tests are validated on each of the python versions before making a new commit,
solution 2: I integrate the above actions into a CI pipeline before making a new commit

It seems to me that solution 2 is preferable (unless you want to go quickly and in which case I will follow solution 1).

May 14 '24 12:05 loco-philippe

No worries - I made a similar mistake before (CI pipeline not installing the correct python version/ environment). We have some guidelines for projects wanting to move to the xarray-contrib organisation, see https://github.com/xarray-contrib/xarray-contrib (not exactly the same but close enough).

For this PR I suggest to convert it to draft mode and we can revisit once you fix the issues. For the GHA pipelines - let me know (it depends a bit if you use conda or pip to install your packages).

May 14 '24 13:05 mathause

I wonder if, instead of describing code from a third-party package, we should add a "see also" or similar to the section? I.e. more prominent than an entry in the ecosystem page, but without putting actual code into the documentation. That way, we don't need to change anything if the third-party package changes (reducing our maintenance burden), and we also don't need to install even more things into the docs environment.

(To be clear, in posting this I'm trying to find general guidelines for what to include in the docs of xarray – it's definitely important to have this somewhere, my question is just where to put it, and how to make it easy to find)

May 14 '24 16:05 keewis

@mathause

For the GHA pipeline, I did some tests on the tabular structure analysis package (with pip) and it seems to work well (ci.yml):

tests are ok with 3.11,
with 3.10, I find the problem arr[*idx]
in 3.9 I also have 'match / case'

Initially, I will implement this simple solution (if I do not encounter dependency problems) with possibly some adaptations but my objective is then to move on to conda (are there any particular points to look at for that ?)

I will then look to integrate the packaging.

@keewis

I understand the difficulty of both including third-party packages and reducing the maintenance burden.

I will therefore adapt to the choices you make on this subject.

May 14 '24 21:05 loco-philippe

The issue is fixed (new version of the ntv_numpy package) and the CI GHA pipeline integrates the tests with python versions 3.10 and 3.11.

I will then add xarray accessors as defined and build a new version.

May 15 '24 22:05 loco-philippe

Cool great to hear!

Thinking about this - I would second @keewis idea. Featuring a prominent section without code would give it visibility, limit maintenance burden, and keep the docs environment smaller. (I maintain another smaller package where a third party package is featured in the docs. While I find it super cool that someone created an extension, it has caused above issues for me.)

Maybe this could be something along the lines:

Lossless and reversible conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The previous example shows that the conversion is not reversible (lossy roundtrip) and
that the size of the ``datasets`` increases. To avoid these problems, the third-party 
`ntv-pandas`__ library offers lossless and reversible conversions between 
``Dataset``/ ``DataArray`` and pandas ``DataFrame`` objects.

__ https://github.com/loco-philippe/ntv-pandas

If you have not done so yet, you can showcase the examples in the docs of ntv-pandas.

May 15 '24 23:05 mathause

The proposal to have Xarray documentation without code linked to a third party package seems logical to me.

So I modified the PR taking into account @mathause's proposal (with some modifications).

I also added the Xarray accessors in the ntv_numpy package. We now have symmetric methods Dataset.nxr.to_dataframe and DataFrame.npd.to_xarray).

Another question: I haven't found any other tools or methodologies that analyze the structure of a tabular data to extract the multidimensional structure. Do you know any?

May 19 '24 22:05 loco-philippe

Thank you @max-sixty and @mathause for your approval of this PR.

I finally added another example (use case) which I think should address new needs (but I don't know Xarray well enough and I'm interested in your opinion).

Note: JSON, Pandas and Xarray interfaces are built on a neutral format as defined in this notebook. I think this structure is consistent with your roadmap (I'm also interested in your opinion).

May 26 '24 20:05 loco-philippe

close/ open to trigger the tests again

May 29 '24 18:05 mathause

No idea why this does not work, but I don't see this could have something to do with the PR itself. I'll merge manually.

Thanks for your PR and willingness to adopt to our changes!

May 30 '24 07:05 mathause

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

May 30 '24 07:05 welcome[bot]

Thanks also for spending time on this PR.

The changes were useful and relevant, so it was normal to take them into account !

May 30 '24 13:05 loco-philippe

xarray
xarray copied to clipboard

User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe

xarray xarray copied to clipboard

User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe

xarray
xarray copied to clipboard