pisa icon indicating copy to clipboard operation
pisa copied to clipboard

pickling very large maps fails

Open jllanfranchi opened this issue 7 years ago • 2 comments

Not sure if this is fix-able. Pickle seems like a bad way to store really large maps (e.g. HDF5 would make more sense). But it might be a bug...

Alternately, could we integrate with npy binary-file format somehow? https://docs.scipy.org/doc/numpy/neps/npy-format.html Do we need to abandon pickle altogether?

jllanfranchi avatar Mar 29 '17 17:03 jllanfranchi

Other alternatives for this:

  • If we use .npy files, we will need to create a directory and each key as a filename and the contents be the value, either a .npy file or .json or somesuch. This gets ugly fast trying to translate a dict into a dir with files.
    • Can use .npz for multiple arrays in one file, but this doesn't help for arbitrary Python objects
  • Google Flatbuffers... but the Python interface looks rather clunky and not well maintained. More stuff to install that requires compilation. Doesn't seem to be an active community of users in Python.
  • Apache Arrow... seems to work well with large arrays, can be memory mapped (not necessary here but nice) and is zero-copy (& fast) like Flatbuffers, though still a nascent project. pip installable, which is nice. Can use feather file format, or the native format, or Apache Parquet(?)
    • Since we already have serializable_state in many core objects which produces a dict of simple Python datatypes (plus numpy types), it seems Arrow might be able to handle this as-is: http://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization
    • Spec is not promised to be stable across versions, so this should not be used for long-term storage; can use for caching, though, and with storage of version, data can be read/interpreted correctly (though this gets hairy to have to have different versions of the same lib to read different files)
    • EDIT: Apache Arrow uses Google Flatbuffers under the hood for some pieces of its internal representation of data
  • HDF5: this is good and only a little bad. We've used it before, it stores large arrays and can store arbitrary things. It's just a big, bloated library that carries far more complexity than necessary. But it works, is cross-platform, not terribly slow or terribly large files, etc.

jllanfranchi avatar Dec 13 '17 16:12 jllanfranchi

See also https://github.com/icecubeopensource/pisa/issues/26

jllanfranchi avatar Dec 13 '17 22:12 jllanfranchi