PyBaMM icon indicating copy to clipboard operation
PyBaMM copied to clipboard

Decide how to bundle `.pickle` files from COMSOL solution results data

Open agriyakhetarpal opened this issue 1 year ago • 5 comments

.pickle files are binary files and not inherently secure. They therefore should be best included elsewhere in PyBaMM's packaging infrastructure when publishing to common distribution channels like PyPI owing to security reasons in the case where these binaries are modified with malicious intent and make it to the develop branch or a PyPI release where PyBaMM can be downloaded from (binary or non-binary). This means they should neither be included in the source distribution (controlled by MANIFEST.in), nor in the wheels (in the case where the wheel is built from the sdist). Binary platform-specific wheels for PyBaMM cannot be built from sdists right now and have to be built in-tree (#3651, #3564), and require restructuring the build system configuration at least, if not rewriting what files are copied into the package structure – since pybamm/input/comsol_results/ is a namespace package (via setuptools.command.install_data).

We have the following options:

  1. Store in JSON, CSV, or TXT file formats (non-binary) instead of .pickle
  2. Use pooch to reproducibly download these files when needed and adjust PyBaMM's public API plus example notebooks for this, while moving these files to a separate repository in the pybamm-team organisation, and marking the repository as an archive to prevent tampering.
  3. Use non-binary storage solutions from xarray or zarr because they can handle multi-dimensional data

Out of these, point 2 requires an internet connection and those wishing to load COMSOL results from these files would need access to the internet to download them, and therefore should be the last option to consider.

agriyakhetarpal avatar Apr 20 '24 18:04 agriyakhetarpal

NumPy arrays in the values of the dict can't be serialised to JSON directly. It might be worth converting these arrays to lists to be able to save them or just use pandas, which is already an optional dependency for PyBaMM, to convert to JSON and read it afterwards where needed

agriyakhetarpal avatar Apr 21 '24 08:04 agriyakhetarpal

@agriyakhetarpal Why is this a release blocker?

kratman avatar Apr 26 '24 14:04 kratman

Ah, I added this in error – most likely it was because I was selecting multiple issues, my apologies. It's something to look into soon enough, but by no means a release blocker. I removed the label.

agriyakhetarpal avatar Apr 26 '24 14:04 agriyakhetarpal

I think the first option would be better, JSON and CSV files would provide a better structure than txt files in my opinion. I could try this out if it's not being carried out by someone else.

santacodes avatar Apr 26 '24 17:04 santacodes

Thanks, assigned you. You'll need to take care of the fact that if you convert NumPy arrays to some other format without a compatibility-provider library and convert them back again, you could potentially lose data. You'll also need to take care of things like floating-point precision (most of the elements in that have decimal values).

agriyakhetarpal avatar Apr 26 '24 17:04 agriyakhetarpal

This is now a legible release blocker, added the label. The reason is that the JSON files are cumulatively summing up to 26.47 MB (https://github.com/pybamm-team/pybamm-data/releases/tag/v1.0.0), which is going to make the sdist and wheels bloated with very large files that are cumulatively greater in size than that of the Python extension module we ship inside the wheel. We will need to move to pooch + add relevant classes/functions to download them quite soon, and we will need to do this before we do the v24.5rc0 pre-release (cc: @santacodes).

The size of the sdist should not be larger than a few MBs at most, and while there is no guideline as such for sdist sizes, it is generally acceptable to keep just minimal files in them so that they are smaller than the wheels, and also to provide a reasonably sized download for users (both wheels and sdist). For context, our Windows wheels are currently ~8 MB, macOS ones are ~12 MB, and those for Linux are ~22 MB in size. This would not have been a blocker if the wheels did not include the files but the sdist did – both of them will, at this time.

agriyakhetarpal avatar May 13 '24 12:05 agriyakhetarpal