filesystem_spec
filesystem_spec copied to clipboard
`TarFileSystem`: Multiple `open()` fails with `compression`
cifar-10-python.tar:
tfs = fsspec.filesystem('tar', fo='cifar-10-python.tar')
for p in tfs.glob('*/data_batch_*'):
batch = tfs.open(p, 'rb')
batch.close()
cifar-10-python.tar.gz:
tfs = fsspec.filesystem('tar', fo='cifar-10-python.tar.gz', compression='gzip')
for p in tfs.glob('*/data_batch_*'):
batch = tfs.open(p, 'rb')
batch.close()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [20], in <cell line: 19>()
19 for p in tfs.glob('*/data_batch_*'):
---> 20 batch = tfs.open(p, 'rb')
21 batch.close()
File ~/.local/lib/python3.9/site-packages/fsspec/spec.py:1034, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
1032 else:
1033 ac = kwargs.pop("autocommit", not self._intrans)
-> 1034 f = self._open(
1035 path,
1036 mode=mode,
1037 block_size=block_size,
1038 autocommit=ac,
1039 cache_options=cache_options,
1040 **kwargs,
1041 )
1042 if compression is not None:
1043 from fsspec.compression import compr
File ~/.local/lib/python3.9/site-packages/fsspec/implementations/tar.py:132, in TarFileSystem._open(self, path, mode, **kwargs)
130 else:
131 newfo = copy.copy(self.fo)
--> 132 newfo.seek(offset)
134 return TarContainedFile(newfo, self.info(path))
File /opt/conda/lib/python3.9/gzip.py:392, in GzipFile.seek(self, offset, whence)
390 elif self.mode == READ:
391 self._check_not_closed()
--> 392 return self._buffer.seek(offset, whence)
394 return self.offset
ValueError: seek of closed file
Edit: possible duplicate of https://github.com/fsspec/filesystem_spec/pull/1025#issuecomment-1263651166
Okay did some digging. It turns out the copying of the gzip.GzipFile object (self.fo in this case) was the caveat:
https://github.com/fsspec/filesystem_spec/blob/3969aaf7378198b92df8dee99cd464d272a0d0f4/fsspec/implementations/tar.py#L126-L131
After removing the offending copy.copy function call I was able to run the code without problems. Are there any specific reasons we need a copy of self.fo before calling seek on it? If the goal was to leave the original file object as if it's untouched, the copying should be done to the io.BufferedReader instances as well because we need the code to be functionally consistent across all cases. Meanwhile I am opening a pull request as this is obviously an issue that needs to be fixed immediately.