xarray
xarray copied to clipboard
User-guide - pandas : Add alternative to xarray.Dataset.from_dataframe
This PR follow the issue #9015 as proposed by @max-sixty.
I added an additional section in the pandas.rst file to provide a third-party pandas interface alternative that is lossless and reversible.
The main contribution is the ability to find the multidimensional structure hidden by the tabular structure.
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. If you have questions, some answers may be found in our contributing guidelines.
Overall I think this looks reasonable! I don't have strong views on the substance, others should comment if they have views...
We'll need to add ntv_pandas to the docs dependencies for the docs tests to pass.
@max-sixty
Thanks for the answer to the question I was going to ask!
Would it also be useful to add ntv_pandas in the ecosystem.rst file (for example in the 'Extend xarray capabilities' category) ?
Would it also be useful to add
ntv_pandasin the ecosystem.rst file (for example in the 'Extend xarray capabilities' category) ?
Yes, please go ahead.
Error Read the Docs :
ModuleNotFoundError Traceback (most recent call last)
Cell In[11], line 1
----> 1 import ntv_pandas as npd
ModuleNotFoundError: No module named 'ntv_pandas'
<<<-------------------------------------------------------------------------
Is it because the package name (ntv-pandas) is different from the module name (ntv_pandas) ?
ModuleNotFoundError: No module named 'ntv_pandas'
Is it available on conda? Otherwise if only on pip then place it at the bottom in the pip section
Now you get a syntax error because arr[*idx] is only available in python 3.11 while our docs are in python 3.10 (and your ntv-numpy package is python 3.9+ - you should add tests for your minimal python version and dependencies).
https://github.com/loco-philippe/ntv-numpy/blob/4b57b8cc1bfab749c01ddf7edbc38a9ef53623df/ntv_numpy/xconnector.py#L268-L269
Checking again - you absolutely need to start testing your packages continuously. I am reluctant to 'endorse' a package that does not have a CI pipeline. Let us know if you need support with that.
@mathause
This bug is clearly unacceptable (I'm ashamed) !!
This demonstrates that I now need to take the time to have a robust CI process. I will first use GitHub Actions to have a build and test pipeline (if you have minimum requirements to respect to endorse packages in Xarray or other advice, I'm interested!)
For the current PR, two solutions:
- solution 1: I correct the identified bug and I check that all the tests are validated on each of the python versions before making a new commit,
- solution 2: I integrate the above actions into a CI pipeline before making a new commit
It seems to me that solution 2 is preferable (unless you want to go quickly and in which case I will follow solution 1).
No worries - I made a similar mistake before (CI pipeline not installing the correct python version/ environment). We have some guidelines for projects wanting to move to the xarray-contrib organisation, see https://github.com/xarray-contrib/xarray-contrib (not exactly the same but close enough).
For this PR I suggest to convert it to draft mode and we can revisit once you fix the issues. For the GHA pipelines - let me know (it depends a bit if you use conda or pip to install your packages).
I wonder if, instead of describing code from a third-party package, we should add a "see also" or similar to the section? I.e. more prominent than an entry in the ecosystem page, but without putting actual code into the documentation. That way, we don't need to change anything if the third-party package changes (reducing our maintenance burden), and we also don't need to install even more things into the docs environment.
(To be clear, in posting this I'm trying to find general guidelines for what to include in the docs of xarray – it's definitely important to have this somewhere, my question is just where to put it, and how to make it easy to find)
@mathause
For the GHA pipeline, I did some tests on the tabular structure analysis package (with pip) and it seems to work well (ci.yml):
- tests are ok with 3.11,
- with 3.10, I find the problem arr[*idx]
- in 3.9 I also have 'match / case'
Initially, I will implement this simple solution (if I do not encounter dependency problems) with possibly some adaptations but my objective is then to move on to conda (are there any particular points to look at for that ?)
I will then look to integrate the packaging.
@keewis
I understand the difficulty of both including third-party packages and reducing the maintenance burden.
I will therefore adapt to the choices you make on this subject.
The issue is fixed (new version of the ntv_numpy package) and the CI GHA pipeline integrates the tests with python versions 3.10 and 3.11.
I will then add xarray accessors as defined and build a new version.
Cool great to hear!
Thinking about this - I would second @keewis idea. Featuring a prominent section without code would give it visibility, limit maintenance burden, and keep the docs environment smaller. (I maintain another smaller package where a third party package is featured in the docs. While I find it super cool that someone created an extension, it has caused above issues for me.)
Maybe this could be something along the lines:
Lossless and reversible conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The previous example shows that the conversion is not reversible (lossy roundtrip) and
that the size of the ``datasets`` increases. To avoid these problems, the third-party
`ntv-pandas`__ library offers lossless and reversible conversions between
``Dataset``/ ``DataArray`` and pandas ``DataFrame`` objects.
__ https://github.com/loco-philippe/ntv-pandas
If you have not done so yet, you can showcase the examples in the docs of ntv-pandas.
The proposal to have Xarray documentation without code linked to a third party package seems logical to me.
So I modified the PR taking into account @mathause's proposal (with some modifications).
I also added the Xarray accessors in the ntv_numpy package. We now have symmetric methods Dataset.nxr.to_dataframe and DataFrame.npd.to_xarray).
Another question: I haven't found any other tools or methodologies that analyze the structure of a tabular data to extract the multidimensional structure. Do you know any?
Thank you @max-sixty and @mathause for your approval of this PR.
I finally added another example (use case) which I think should address new needs (but I don't know Xarray well enough and I'm interested in your opinion).
Note: JSON, Pandas and Xarray interfaces are built on a neutral format as defined in this notebook. I think this structure is consistent with your roadmap (I'm also interested in your opinion).
close/ open to trigger the tests again
No idea why this does not work, but I don't see this could have something to do with the PR itself. I'll merge manually.
Thanks for your PR and willingness to adopt to our changes!
Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! 
Thanks also for spending time on this PR.
The changes were useful and relevant, so it was normal to take them into account !