serialization of file-like objects
Problem description
I'd be curious to get opinions on whether serialization/deserialization should be supported for the file-like objects at the core of this library. This would be useful for distributed processing workflows that pass around either the file-like objects themselves or - and this is the case for xarray, which is the use case I'm interested in specifically - which can be constructed using these file-like objects as arguments. Obviously, if xarray datasets are hanging onto file-like objects that are not serializable, they are then not serializable themselves.
Steps/code to reproduce the problem
- the file-like object itself
import pickle
import smart_open
http_file = smart_open.open('http://example.com/index.html')
pickle.dumps(http_file)
The above throws NotImplementedError: object proxy must define __reduce_ex__()
- the file-like object blowing up downstream object serialization
import pickle
import smart_open
import xarray as xr
netcdf_path = "https://some/netcdf/path.nc"
sf = smart_open.open(netcdf_path, 'rb')
ds = xr.open_dataset(sf)
pickle.dumps(ds)
This one throws TypeError: cannot pickle '_io.BufferedReader' object
Versions
macOS-14.4.1-arm64-arm-64bit Python 3.11.9 (main, May 22 2024, 12:34:58) [Clang 15.0.0 (clang-1500.3.9.4)] smart_open 7.0.4
Checklist
Before you create the issue, please make sure you have:
- [x] Described the problem clearly
- [x] Provided a minimal reproducible example, including any required data
- [x] Provided the version numbers of the relevant software
I don't think serializing streams is even theoretically possible in general. Or rather, where it is possible, it is the business of the file-like object itself to support Python's pickle protocol, serializing its internal stream state somehow.
But open to ideas, CC @mpenkov :)
BufferedReader (only used in the smart_open.compression module) is thread-safe (ref)
but thread-safe != fork-safe so I don't think the io classes are made for multiprocessing.
I would suggest reading into a tempfile (or shared_memory if filesize allows), and sharing the filename/mem-pointer across processes.
Good points, to be sure. I'm not proposing storage of the bytes so much as passing around the file-like objects as references (perhaps keeping seek information, but not even necessarily). This is would enable the things opened and then potentially passed to xarray to be moved between machines inside of Dask/Spark/etc. clusters nicely. Obviously this wouldn't work for disk-local file access, but for cloud providers, things online, etc. serializing the appropriate configs should be sufficient to realize the file-like objects on the other side to then seek into and read byte ranges or what have you
you could try serialising with dill. afaik dask uses/used it. maybe you can adopt it in xarray?
For sure, dill can solve the issue in some instances but dill also doesn't seem to work in this case. I was thinking that it might be possible to manually specify the conditions for ser/de behaviors via these couple of magic methods (example properties here, but they would likely be specific to each backend):
def __getstate__(self):
# Called when pickling
return {'url': self.url, 'position': self._position}
def __setstate__(self, state):
# Called when unpickling
self.__init__(state['url'])
self.seek(state['position'])
Closing as out-of-scope.