pisa
pisa copied to clipboard
pickling very large maps fails
Not sure if this is fix-able. Pickle seems like a bad way to store really large maps (e.g. HDF5 would make more sense). But it might be a bug...
Alternately, could we integrate with npy binary-file format somehow? https://docs.scipy.org/doc/numpy/neps/npy-format.html Do we need to abandon pickle altogether?
Other alternatives for this:
- If we use
.npy
files, we will need to create a directory and each key as a filename and the contents be the value, either a.npy
file or.json
or somesuch. This gets ugly fast trying to translate a dict into a dir with files.- Can use
.npz
for multiple arrays in one file, but this doesn't help for arbitrary Python objects
- Can use
- Google Flatbuffers... but the Python interface looks rather clunky and not well maintained. More stuff to install that requires compilation. Doesn't seem to be an active community of users in Python.
- Apache Arrow... seems to work well with large arrays, can be memory mapped (not necessary here but nice) and is zero-copy (& fast) like Flatbuffers, though still a nascent project.
pip
installable, which is nice. Can usefeather
file format, or the native format, or Apache Parquet(?)- Since we already have
serializable_state
in many core objects which produces a dict of simple Python datatypes (plus numpy types), it seems Arrow might be able to handle this as-is: http://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization - Spec is not promised to be stable across versions, so this should not be used for long-term storage; can use for caching, though, and with storage of version, data can be read/interpreted correctly (though this gets hairy to have to have different versions of the same lib to read different files)
- EDIT: Apache Arrow uses Google Flatbuffers under the hood for some pieces of its internal representation of data
- Since we already have
- HDF5: this is good and only a little bad. We've used it before, it stores large arrays and can store arbitrary things. It's just a big, bloated library that carries far more complexity than necessary. But it works, is cross-platform, not terribly slow or terribly large files, etc.
See also https://github.com/icecubeopensource/pisa/issues/26