PyBaMM Investigate alternatives to `xarray` to handle `ProcessedVariable` computations

Investigate alternatives to `xarray` to handle `ProcessedVariable` computations

Open agriyakhetarpal opened this issue 4 months ago • 4 comments

Recently, #3892 highlighted that pandas was being installed as an implicit required dependency for PyBaMM, because it was a required dependency for one of our required dependencies (xarray). pandas was otherwise listed as an optional dependency with the [pandas] extra and is currently used only for handling CSV files.

This dependence on xarray is particularly concerning, because:

If pandas decides to act upon PDEP-10 with v3, it would drastically increase the download size for PyBaMM (pyarrow wheels across platforms are 120+ megabytes in size at a minimum).
This would have complications on if things like Pyodide support are considered – where running PyBaMM on the browser would require excess bandwidth utilisation and slow down usage workflows. It would also affect regular users by a bit in Google Colab, where Python virtual environments and dependencies are not saved or cached.

Prior to the use of xarray (see #2366) as a backend for the ProcessedVariable and the ProcessedVariableComputed classes, the scipy.interpolate module was being used – which could be an option to return to.

There is time until pandas decides on this and also until we release v24.5, so we can take into account some of the developments around this area as they arise (as discussed in the technical roadmap meeting on 18/03/2024).

Mar 20 '24 17:03 agriyakhetarpal

What is pyodide being used for if it is an issue?

I have used pyarrow and pandas in a lot of web based apps without issue. Both pandas and pyarrow are pretty common in data science, so I know these get used in web/notebook applications on a regular basis

Mar 20 '24 20:03 kratman

What is pyodide being used for if it is an issue?

It's not being used by us currently, but as a part of my work assignment I am extending support for it across a lot of PyData projects and across the Scientific Python ecosystem (please see https://github.com/Quansight-Labs/czi-scientific-python-mgmt/issues/18 and https://github.com/Quansight-Labs/czi-scientific-python-mgmt/issues/19). PyBaMM isn't quite there yet, because we have CasADi as a dependency—it is tricky to compile it to WASM—if it becomes optional, we could move things forward on that (see #3826). The best and most stable example of where you can see Pyodide currently is on any of the usage examples in the scikit-learn documentation, where you can bring interactive docs via client-side JupyterLite notebooks.

I have used pyarrow and pandas in a lot of web based apps without issue. Both pandas and pyarrow are pretty common in data science, so I know these get used in web/notebook applications on a regular basis

There's no issue as such if you do so locally for any data science workflows because the pyarrow backend is extremely fast, but 1. those with unstable connections can have issues running such notebooks online, and 2. having a heavy (required) dependency graph in general isn't good for any library (packaging/distribution, for example, is one of the areas). But this is a smaller part of the picture; some of the responses on https://github.com/pandas-dev/pandas/issues/54466 are quite insightful in this regard.

Mar 20 '24 20:03 agriyakhetarpal

Yeah if we are going to drop xarray then using scipy or numpy native features would be good. However, it looks like we use pandas directly in a bunch of files, so it is not just due to xarray. I think if you want to make pandas optional, then you would need to pandas from a bunch of places (notebooks, tests, etc) and not just remove xarray.

Pandas can be useful for analysis and plotting, so we should probably think about if it is useful on the whole to include it and make sure it is a concern for our users. Realistically optional dependencies just make things more complicated. Unless we have fully optional modules then we should try to just remove problematic libraries all together.

Mar 20 '24 21:03 kratman

We did have pandas as an optional dependency before #3892, didn't we? I imagine it should not be a lot of work to make it fully optional back again with the import_optional_dependency wrapper. Or are we using it in a notebook where we haven't installed it in the introductory code cell?

A lot of the plotting features (for example matplotlib) were set as optional so that you were not forced to use it, and therefore you could use libraries like holoviz, pyvista, altair, seaborn, or any others of your choice offering a plotting backend and a graphics module. It is still optional at this time but in PyBaMM's history before v23.5 it was one of the "truly" optional dependencies (but we didn't have a list of optional dependencies back then).

Mar 20 '24 23:03 agriyakhetarpal

PyBaMM PyBaMM copied to clipboard

Investigate alternatives to `xarray` to handle `ProcessedVariable` computations

PyBaMM
PyBaMM copied to clipboard