Inconsistent behaviour of `get_file` using compression with different filesystems
I have problems getting consistent behaviours when using get_file for different filesystems when using the compression parameter. My understanding from the AbstractFilesystem implementation of that method is that kwargs should be used by the open method, but for some filesystems it fails silently.
My goal was to fetch files and decompress them on the fly: maybe there is a better suited function for this?
Minimal example:
import fsspec
import bz2
from zipfile import ZipFile
# create data file
filename = "/tmp/important_data.txt.bz2"
data = b"very important data."
with open(filename, "wb") as fd:
fd.write(bz2.compress(data))
# open with compression
print(fsspec.open(filename, compression="infer").open().read())
# prints "b'very important data.'"
# fetch from local filesystem
fsspec.filesystem("file").get_file(filename, "/tmp/new", compression="infer")
print(open("/tmp/new", "rb").read())
# prints "b'BZh91AY&SY\x85\xf4|P\x00\x00\t\x11\x80@\x01&#\xd5 \x00"\x9e\x93i\x06\xca\x10\x00\x02\xdc\xc6\x0c\xb1\xc2\xbc\xad\x16\xc7\xc5\xdc\x91N\x14$!}\x1f\x14\x00'"
# fetch from ssh filesystem
fsspec.filesystem("ssh", host="localhost").get_file(filename, "/tmp/new", compression="infer")
print(open("/tmp/new", "rb").read())
# prints "b'BZh91AY&SY\x85\xf4|P\x00\x00\t\x11\x80@\x01&#\xd5 \x00"\x9e\x93i\x06\xca\x10\x00\x02\xdc\xc6\x0c\xb1\xc2\xbc\xad\x16\xc7\xc5\xdc\x91N\x14$!}\x1f\x14\x00'"
# fetch from zip filesystem
zfile = filename + ".zip"
with ZipFile(zfile, 'w') as zipf:
zipf.write(filename)
of = fsspec.open("zip://" + filename + "::file://" + zfile)
of.fs.get_file(filename, "/tmp/new", compression="infer")
print(open("/tmp/new", "rb").read())
# prints "b'very important data.'"
The fallback implementation of get_file is via open(), so extra kwargs like compression get passed down. However, many filesystem backends have more specialised get_file methods, to allow better operation like parallel downloading. In such cases, we are not necessarily streaming the bytes, and so on-the-fly decompression would not be possible anyway. I think we should say, that only open() is guaranteed to layer file-like objects for decompression or text mode.
@martindurant thanks for the clarification! I understand I will have to implement a custom solution for my use case. But I think my point still stands, about the silent ignoring of the kwargs? Wouldn’t it be better to raise an error in such a case?
A general problem throughout the fsspec code, is that there are many places that kwargs can get passed to, including general purpose arguments to the third-party backend libraries. Therefore, most methods only extract the arguments they need and pass everything else along, and whether you get an exception or not, depends on how the third-party package is called and what it expects.
I understand, thanks for the explanation. Feel free to close this then.