aiofiles icon indicating copy to clipboard operation
aiofiles copied to clipboard

From a txt (compressed or not: txt, txt.bz, txt.bz2), allowing lines to be processed while other lines are loaded in the background, with support for buffering, encoding, and detailed logging settings.

Open fabriciorsf opened this issue 8 months ago • 1 comments

See example (only txt) at: https://gist.github.com/91b58e3ab8e10025cfa4a5935bcfaaa4.

To read any compressed file, can be:

@asynccontextmanager
async def async_read_txt_file(filename: str,
                              buffer_hint: int = -1,
                              encoding='utf-8',
                              errors=None,
                              verbose=False):
    if verbose:
        LOGGER.setLevel(logging.DEBUG)
    open_file = (gzip.open if filename.endswith('.gz') \
                    else (bz2.open if filename.endswith('.bz2') \
                        else open))
    multiply_buffer = 3 if filename.endswith('.bz2') else 1
    buffer_hint = max(buffer_hint, BUFFER_HINT)
    buffer_hint = min(buffer_hint, os.path.getsize(filename) * multiply_buffer)

    kwargs = {'mode': 'rt'}
    if encoding is not None:
        kwargs.update({'encoding': encoding})
    if errors is not None:
        kwargs.update({'errors': errors})
    LOGGER.info(f"Opening file {filename} with buffer hint {buffer_hint} and keyword arguments {kwargs}...")

    with open_file(filename, **kwargs) as opened_file:
        def _readlines_():
            LOGGER.debug(f"Reading lines from file {filename}")
            # may be slow as it has disk access
            lines = opened_file.readlines(buffer_hint)
            if lines:
                LOGGER.debug(f"{len(lines)} lines read from file {filename}")
            else:
                LOGGER.debug(f"End of file reading: {filename}")
            return lines
        async def _gen_():
            lines = _readlines_()
            task = None
            while lines:
                task = asyncio.gather(asyncio.to_thread(_readlines_))
                for line in lines:
                    yield line
                lines = await task
                lines = lines[0]
        yield _gen_()
    LOGGER.setLevel(logging.INFO)

fabriciorsf avatar Apr 20 '25 00:04 fabriciorsf

I think what you're asking for here is out of scope for aiofiles. However, I would take async versions of GzipFile/BZ2File if someone were to contribute quality implementations.

Tinche avatar Apr 20 '25 10:04 Tinche