PyBaMM
PyBaMM copied to clipboard
Investigate alternatives to `xarray` to handle `ProcessedVariable` computations
Recently, #3892 highlighted that pandas
was being installed as an implicit required dependency for PyBaMM, because it was a required dependency for one of our required dependencies (xarray
). pandas
was otherwise listed as an optional dependency with the [pandas]
extra and is currently used only for handling CSV files.
This dependence on xarray
is particularly concerning, because:
- If
pandas
decides to act upon PDEP-10 with v3, it would drastically increase the download size for PyBaMM (pyarrow
wheels across platforms are 120+ megabytes in size at a minimum). - This would have complications on if things like Pyodide support are considered – where running PyBaMM on the browser would require excess bandwidth utilisation and slow down usage workflows. It would also affect regular users by a bit in Google Colab, where Python virtual environments and dependencies are not saved or cached.
Prior to the use of xarray
(see #2366) as a backend for the ProcessedVariable
and the ProcessedVariableComputed
classes, the scipy.interpolate
module was being used – which could be an option to return to.
There is time until pandas
decides on this and also until we release v24.5, so we can take into account some of the developments around this area as they arise (as discussed in the technical roadmap meeting on 18/03/2024).
What is pyodide being used for if it is an issue?
I have used pyarrow and pandas in a lot of web based apps without issue. Both pandas and pyarrow are pretty common in data science, so I know these get used in web/notebook applications on a regular basis
What is pyodide being used for if it is an issue?
It's not being used by us currently, but as a part of my work assignment I am extending support for it across a lot of PyData projects and across the Scientific Python ecosystem (please see https://github.com/Quansight-Labs/czi-scientific-python-mgmt/issues/18 and https://github.com/Quansight-Labs/czi-scientific-python-mgmt/issues/19). PyBaMM isn't quite there yet, because we have CasADi as a dependency—it is tricky to compile it to WASM—if it becomes optional, we could move things forward on that (see #3826). The best and most stable example of where you can see Pyodide currently is on any of the usage examples in the scikit-learn
documentation, where you can bring interactive docs via client-side JupyterLite notebooks.
I have used pyarrow and pandas in a lot of web based apps without issue. Both pandas and pyarrow are pretty common in data science, so I know these get used in web/notebook applications on a regular basis
There's no issue as such if you do so locally for any data science workflows because the pyarrow
backend is extremely fast, but 1. those with unstable connections can have issues running such notebooks online, and 2. having a heavy (required) dependency graph in general isn't good for any library (packaging/distribution, for example, is one of the areas). But this is a smaller part of the picture; some of the responses on https://github.com/pandas-dev/pandas/issues/54466 are quite insightful in this regard.
Yeah if we are going to drop xarray then using scipy or numpy native features would be good. However, it looks like we use pandas directly in a bunch of files, so it is not just due to xarray. I think if you want to make pandas optional, then you would need to pandas from a bunch of places (notebooks, tests, etc) and not just remove xarray.
Pandas can be useful for analysis and plotting, so we should probably think about if it is useful on the whole to include it and make sure it is a concern for our users. Realistically optional dependencies just make things more complicated. Unless we have fully optional modules then we should try to just remove problematic libraries all together.
We did have pandas as an optional dependency before #3892, didn't we? I imagine it should not be a lot of work to make it fully optional back again with the import_optional_dependency
wrapper. Or are we using it in a notebook where we haven't installed it in the introductory code cell?
A lot of the plotting features (for example matplotlib
) were set as optional so that you were not forced to use it, and therefore you could use libraries like holoviz
, pyvista
, altair
, seaborn
, or any others of your choice offering a plotting backend and a graphics module. It is still optional at this time but in PyBaMM's history before v23.5 it was one of the "truly" optional dependencies (but we didn't have a list of optional dependencies back then).