filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

`TarFileSystem`: Multiple `open()` fails with `compression`

Open oakaigh opened this issue 3 years ago • 1 comments

cifar-10-python.tar:

tfs = fsspec.filesystem('tar', fo='cifar-10-python.tar')

for p in tfs.glob('*/data_batch_*'): 
    batch = tfs.open(p, 'rb')
    batch.close()

cifar-10-python.tar.gz:

tfs = fsspec.filesystem('tar', fo='cifar-10-python.tar.gz', compression='gzip')

for p in tfs.glob('*/data_batch_*'): 
    batch = tfs.open(p, 'rb')
    batch.close()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [20], in <cell line: 19>()
     19 for p in tfs.glob('*/data_batch_*'): 
---> 20     batch = tfs.open(p, 'rb')
     21     batch.close()

File ~/.local/lib/python3.9/site-packages/fsspec/spec.py:1034, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1032 else:
   1033     ac = kwargs.pop("autocommit", not self._intrans)
-> 1034     f = self._open(
   1035         path,
   1036         mode=mode,
   1037         block_size=block_size,
   1038         autocommit=ac,
   1039         cache_options=cache_options,
   1040         **kwargs,
   1041     )
   1042     if compression is not None:
   1043         from fsspec.compression import compr

File ~/.local/lib/python3.9/site-packages/fsspec/implementations/tar.py:132, in TarFileSystem._open(self, path, mode, **kwargs)
    130 else:
    131     newfo = copy.copy(self.fo)
--> 132 newfo.seek(offset)
    134 return TarContainedFile(newfo, self.info(path))

File /opt/conda/lib/python3.9/gzip.py:392, in GzipFile.seek(self, offset, whence)
    390 elif self.mode == READ:
    391     self._check_not_closed()
--> 392     return self._buffer.seek(offset, whence)
    394 return self.offset

ValueError: seek of closed file

Edit: possible duplicate of https://github.com/fsspec/filesystem_spec/pull/1025#issuecomment-1263651166

oakaigh avatar Oct 05 '22 20:10 oakaigh

Okay did some digging. It turns out the copying of the gzip.GzipFile object (self.fo in this case) was the caveat: https://github.com/fsspec/filesystem_spec/blob/3969aaf7378198b92df8dee99cd464d272a0d0f4/fsspec/implementations/tar.py#L126-L131

After removing the offending copy.copy function call I was able to run the code without problems. Are there any specific reasons we need a copy of self.fo before calling seek on it? If the goal was to leave the original file object as if it's untouched, the copying should be done to the io.BufferedReader instances as well because we need the code to be functionally consistent across all cases. Meanwhile I am opening a pull request as this is obviously an issue that needs to be fixed immediately.

oakaigh avatar Oct 05 '22 22:10 oakaigh