polars icon indicating copy to clipboard operation
polars copied to clipboard

Scan zipped files

Open gab23r opened this issue 1 year ago • 1 comments

Problem description

I wish I could use polars to scan zipped csv (and more ?) files.

This exemple works with read_csv but fails with scan_csv

import os
import shutil


df = pd.DataFrame({'col': [126.3263, 45.23874]})

# create zip
os.mkdir('tmp')
df.to_csv('./tmp/tmp.csv')
shutil.make_archive('myzip', 'zip', 'tmp')

# try to read zipped_file
with zipfile.ZipFile('myzip.zip') as zipFile:
    df = pl.scan_csv(zipFile.read('tmp.csv'))

gab23r avatar Jun 28 '23 15:06 gab23r

Scan needs to recieve a path, whereas zipfile requires supplying Polars with a file handle to the internal file location, because your zip could contain more than one file. Even on files you can get an unambiguous path towards, though, like czv.gz and csv.xz, scan_csv will actually refuse to read those and ask you to use read_csv instead (see https://github.com/pola-rs/polars/issues/7287).

sm-Fifteen avatar Feb 21 '24 20:02 sm-Fifteen

read_csv can read singlular compressed files just fine. But when globbing, scan_csv gets called, causing it to give up. Not sure why this doesn't work in the current implementation.

neverlink avatar Jun 17 '24 18:06 neverlink