MUSE_OS icon indicating copy to clipboard operation
MUSE_OS copied to clipboard

Revise the version of Pandas being targeted

Open TinyMarsh opened this issue 2 years ago • 7 comments

While experimenting with extending the CI workflow (#119) to test on a matrix of python versions (3.8 - 3.11) and platforms (Linux, Windows and MacOS), tests failed for Python versions 3.10 and 3.11 on all platforms.

The cause of failure in each of these combinations was the targeting of pandas<=1.3. As per the documentation, Pandas 1.3 does not support Python versions 3.10+. The minimum version of Pandas which supports up to Python 3.11 is Pandas 1.5.

Suggest investigating if MUSE's dependency of pandas<=1.3 is still relevant and, is possible, switch to targeting pandas>=1.5.

TinyMarsh avatar May 19 '23 13:05 TinyMarsh

I've been investigating the removal of the pinning of some of the dependencies, in particular, pandas, to try to support Python 3.10 and 3.11. These are my findings:

  • It seems pandas version can be increased to <1.5 without any issues - all tests pass in Python 3.8 and 3.9.
  • In more modern versions of pandas (>=1.5), using iteritems is deprecated, which causes an error in xarray==2022.3.0. There might be other issues, but that's the one that pops first and makes things fail.
  • In Python 3.10, while keeping pandas==1.4.4, xarray==2022.3.0 fails on its own, too, due to a change in importlib. The fix was implemented in xarray==2022.6.0
  • If we increase xarray version to >=2022.6.0 everything breaks down in MUSE due to a massive refactoring of how xarray works. As a result, pretty much all in-place modifications of DataArrays and Datasets affecting the indexes are no longer possible. MUSE relies heavily in these in-place modifications. As an example of a change that would be needed, the following type of assignments:
finest.coords["finest_timeslice"] = index

will need to become:

finest = finest.drop_vars({"month", "finest_timeslice", "day", "hour"}).assign_coords(finest_timeslice=index)

as finest cannot have its indexing coordinate finest_timeslice modified in-place.

In summary, even though we can increase pandas version to 1.4.4, it does't help MUSE supporting more modern Python versions because then xarray refactoring becomes the limiting factor.

Adapting MUSE to use the newest versions of xarray and pandas is essential to ensure the sustainability of the tool, however, it will also be a major undertaking lasting several weeks of dedicated (and painful) effort and that will result in the modification of large portions of the codebase.

@sgiarols @ahawkes , up to you how you want to proceed.

dalonsoa avatar Jun 02 '23 13:06 dalonsoa

Thanks. I often regret that we moved to xarray, rather than just sticking to pandas, as xarray constantly undergoes massive updates. I think we need the tool should be sustained in the long-term and that an upgrade to the newest packages seems crucial at this stage. I would leave to @ahawkes final considerations, as the refactoring time, if we decide to go for it, may require a conversation on the engagement plans.

sgiarols avatar Jun 03 '23 10:06 sgiarols

I think xarray is the right tool for the job, but it is true that we picked it when it was in a very early state of development and it was not stable. That was possibly an error, indeed. The update in 2022.6.0 has annoyed a lot of people, but I think it has made the tool more mature and things should be more stable from now on.

Thinking in the future, I think it is important not to kick the can down the road and when incompatibilities arise with any new version of any package, we fix them rather than pinning to an older version because that only makes the tool to accumulate technical debt until it explodes.

Anyway, let's thing about it and maybe we can have a chat about how to best proceed.

dalonsoa avatar Jun 03 '23 10:06 dalonsoa

I've already started to work on this. Fixed a couple of issues, but I'm stuck in another. I've posted a question in StackOverflow in case someone can point me in the right direction: https://stackoverflow.com/questions/76471238/why-i-cannot-add-a-dataarray-to-an-existing-dataset-with-a-multiindex

dalonsoa avatar Jun 14 '23 08:06 dalonsoa

@sgiarols , it seems I might have hit a bug in 'xarray'. See https://github.com/pydata/xarray/issues/7921 .

I'm not sure if I will be able to help fixing it, but I'll keep an eye.

If this is really a bug and is fixed, then I've the feeling our work refactoring 'muse' to work with the latest versions of 'xarray' and 'pandas' will be much easier.

In the meantime, I'll work on other stuff.

dalonsoa avatar Jun 15 '23 16:06 dalonsoa

Following on this, the minimum working example meant to reproduce our problem that I described in https://github.com/pydata/xarray/issues/7921 is working fine. It is unclear why it is not working for us. I suspect the culprit is in the timeslices, which are a really complex structure and might have been created in a way not fully compatible with the current way of doing things. I'm trying to get to a minimum working example that works with our structure.

dalonsoa avatar Jun 29 '23 06:06 dalonsoa

@sgiarols I'm going to pack this as it has been identified as a proper bug and added to the list of bugs to be sorted out related to indexes (see https://github.com/pydata/xarray/projects/1#card-89778835). Until that is sorted out, there's really not much point on us trying to make MUSE to work with modern versions of pandas and xarray. I will move to other stuff.

dalonsoa avatar Jul 12 '23 11:07 dalonsoa