smart_open icon indicating copy to clipboard operation
smart_open copied to clipboard

serialization of file-like objects

Open moradology opened this issue 1 year ago • 5 comments

Problem description

I'd be curious to get opinions on whether serialization/deserialization should be supported for the file-like objects at the core of this library. This would be useful for distributed processing workflows that pass around either the file-like objects themselves or - and this is the case for xarray, which is the use case I'm interested in specifically - which can be constructed using these file-like objects as arguments. Obviously, if xarray datasets are hanging onto file-like objects that are not serializable, they are then not serializable themselves.

Steps/code to reproduce the problem

  1. the file-like object itself
import pickle
import smart_open

http_file = smart_open.open('http://example.com/index.html')
pickle.dumps(http_file)

The above throws NotImplementedError: object proxy must define __reduce_ex__()

  1. the file-like object blowing up downstream object serialization
import pickle
import smart_open
import xarray as xr

netcdf_path = "https://some/netcdf/path.nc"
sf = smart_open.open(netcdf_path, 'rb')
ds = xr.open_dataset(sf)

pickle.dumps(ds)

This one throws TypeError: cannot pickle '_io.BufferedReader' object

Versions

macOS-14.4.1-arm64-arm-64bit Python 3.11.9 (main, May 22 2024, 12:34:58) [Clang 15.0.0 (clang-1500.3.9.4)] smart_open 7.0.4

Checklist

Before you create the issue, please make sure you have:

  • [x] Described the problem clearly
  • [x] Provided a minimal reproducible example, including any required data
  • [x] Provided the version numbers of the relevant software

moradology avatar Jul 30 '24 21:07 moradology

I don't think serializing streams is even theoretically possible in general. Or rather, where it is possible, it is the business of the file-like object itself to support Python's pickle protocol, serializing its internal stream state somehow.

But open to ideas, CC @mpenkov :)

piskvorky avatar Jul 31 '24 06:07 piskvorky

BufferedReader (only used in the smart_open.compression module) is thread-safe (ref) but thread-safe != fork-safe so I don't think the io classes are made for multiprocessing.

I would suggest reading into a tempfile (or shared_memory if filesize allows), and sharing the filename/mem-pointer across processes.

ddelange avatar Jul 31 '24 06:07 ddelange

Good points, to be sure. I'm not proposing storage of the bytes so much as passing around the file-like objects as references (perhaps keeping seek information, but not even necessarily). This is would enable the things opened and then potentially passed to xarray to be moved between machines inside of Dask/Spark/etc. clusters nicely. Obviously this wouldn't work for disk-local file access, but for cloud providers, things online, etc. serializing the appropriate configs should be sufficient to realize the file-like objects on the other side to then seek into and read byte ranges or what have you

moradology avatar Jul 31 '24 13:07 moradology

you could try serialising with dill. afaik dask uses/used it. maybe you can adopt it in xarray?

ddelange avatar Jul 31 '24 17:07 ddelange

For sure, dill can solve the issue in some instances but dill also doesn't seem to work in this case. I was thinking that it might be possible to manually specify the conditions for ser/de behaviors via these couple of magic methods (example properties here, but they would likely be specific to each backend):

    def __getstate__(self):
        # Called when pickling
        return {'url': self.url, 'position': self._position}

    def __setstate__(self, state):
        # Called when unpickling
        self.__init__(state['url'])
        self.seek(state['position'])

moradology avatar Jul 31 '24 18:07 moradology

Closing as out-of-scope.

mpenkov avatar Dec 17 '24 13:12 mpenkov