MUSE_OS
MUSE_OS copied to clipboard
Revise the version of Pandas being targeted
While experimenting with extending the CI workflow (#119) to test on a matrix of python versions (3.8 - 3.11) and platforms (Linux, Windows and MacOS), tests failed for Python versions 3.10 and 3.11 on all platforms.
The cause of failure in each of these combinations was the targeting of pandas<=1.3. As per the documentation, Pandas 1.3 does not support Python versions 3.10+. The minimum version of Pandas which supports up to Python 3.11 is Pandas 1.5.
Suggest investigating if MUSE's dependency of pandas<=1.3 is still relevant and, is possible, switch to targeting pandas>=1.5.
I've been investigating the removal of the pinning of some of the dependencies, in particular, pandas, to try to support Python 3.10 and 3.11. These are my findings:
- It seems
pandasversion can be increased to<1.5without any issues - all tests pass in Python3.8and3.9. - In more modern versions of pandas (
>=1.5), usingiteritemsis deprecated, which causes an error inxarray==2022.3.0. There might be other issues, but that's the one that pops first and makes things fail. - In Python
3.10, while keepingpandas==1.4.4,xarray==2022.3.0fails on its own, too, due to a change inimportlib. The fix was implemented inxarray==2022.6.0 - If we increase
xarrayversion to>=2022.6.0everything breaks down in MUSE due to a massive refactoring of howxarrayworks. As a result, pretty much all in-place modifications ofDataArraysandDatasetsaffecting the indexes are no longer possible. MUSE relies heavily in these in-place modifications. As an example of a change that would be needed, the following type of assignments:
finest.coords["finest_timeslice"] = index
will need to become:
finest = finest.drop_vars({"month", "finest_timeslice", "day", "hour"}).assign_coords(finest_timeslice=index)
as finest cannot have its indexing coordinate finest_timeslice modified in-place.
In summary, even though we can increase pandas version to 1.4.4, it does't help MUSE supporting more modern Python versions because then xarray refactoring becomes the limiting factor.
Adapting MUSE to use the newest versions of xarray and pandas is essential to ensure the sustainability of the tool, however, it will also be a major undertaking lasting several weeks of dedicated (and painful) effort and that will result in the modification of large portions of the codebase.
@sgiarols @ahawkes , up to you how you want to proceed.
Thanks. I often regret that we moved to xarray, rather than just sticking to pandas, as xarray constantly undergoes massive updates. I think we need the tool should be sustained in the long-term and that an upgrade to the newest packages seems crucial at this stage. I would leave to @ahawkes final considerations, as the refactoring time, if we decide to go for it, may require a conversation on the engagement plans.
I think xarray is the right tool for the job, but it is true that we picked it when it was in a very early state of development and it was not stable. That was possibly an error, indeed. The update in 2022.6.0 has annoyed a lot of people, but I think it has made the tool more mature and things should be more stable from now on.
Thinking in the future, I think it is important not to kick the can down the road and when incompatibilities arise with any new version of any package, we fix them rather than pinning to an older version because that only makes the tool to accumulate technical debt until it explodes.
Anyway, let's thing about it and maybe we can have a chat about how to best proceed.
I've already started to work on this. Fixed a couple of issues, but I'm stuck in another. I've posted a question in StackOverflow in case someone can point me in the right direction: https://stackoverflow.com/questions/76471238/why-i-cannot-add-a-dataarray-to-an-existing-dataset-with-a-multiindex
@sgiarols , it seems I might have hit a bug in 'xarray'. See https://github.com/pydata/xarray/issues/7921 .
I'm not sure if I will be able to help fixing it, but I'll keep an eye.
If this is really a bug and is fixed, then I've the feeling our work refactoring 'muse' to work with the latest versions of 'xarray' and 'pandas' will be much easier.
In the meantime, I'll work on other stuff.
Following on this, the minimum working example meant to reproduce our problem that I described in https://github.com/pydata/xarray/issues/7921 is working fine. It is unclear why it is not working for us. I suspect the culprit is in the timeslices, which are a really complex structure and might have been created in a way not fully compatible with the current way of doing things. I'm trying to get to a minimum working example that works with our structure.
@sgiarols I'm going to pack this as it has been identified as a proper bug and added to the list of bugs to be sorted out related to indexes (see https://github.com/pydata/xarray/projects/1#card-89778835). Until that is sorted out, there's really not much point on us trying to make MUSE to work with modern versions of pandas and xarray. I will move to other stuff.