polars
polars copied to clipboard
Cannot properly read `csv.gz` file in Google Storage bucket
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
csv.gz
file in Google Storage (GS) bucket cannot be properly read using pl.read_csv
. The results appear to be garbled:
shape: (0, 1)
┌──────────────────────────────┐
│ �6�d�test_a1.csv J�I�I�J�1ԩ… │
│ --- │
│ str │
╞══════════════════════════════╡
└──────────────────────────────┘
Reproducible example
>>> import polars as pl
>>> df_ng = pl.read_csv("gs://my_bucket/path/to/test_a1.csv.gz")
>>> df_ng
shape: (0, 1)
┌──────────────────────────────┐
│ �6�d�test_a1.csv J�I�I�J�1ԩ… │
│ --- │
│ str │
╞══════════════════════════════╡
└──────────────────────────────┘
Expected behavior
The expected results are the ones of reading the same file in local PC (file data is linked in the bellow):
# reading csv.gz in local PC is no problem
>>> df_local = pl.read_csv("/my_pc/path/to/test_a1.csv.gz")
>>> df_local
shape: (2, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ a ┆ 1 ┆ x │
│ a ┆ 1 ┆ y │
└─────┴─────┴─────┘
Installed versions
--------Version info---------
Polars: 0.18.3
Index type: UInt32
Platform: macOS-13.4-arm64-arm-64bit
Python: 3.10.11 (main, May 17 2023, 14:30:36) [Clang 14.0.6 ]
----Optional dependencies----
numpy: 1.25.0
pandas: 2.0.2
pyarrow: 12.0.1
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2023.6.0
matplotlib: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
The similar result has occurred in AWS S3.
>>> import polars as pl
>>> df_ng = pl.read_csv("s3://my_bucket/path/to/test_a1.csv.gz")
>>> df_ng
shape: (0, 1)
┌─────────────────────────────┐
│ <�d�test_a1.csv J�I�I�J�1ԩ… │
│ --- │
│ str │
╞═════════════════════════════╡
└─────────────────────────────┘
First decompress the file.
pl.read_csv
does not support compressed files?
Looks like pl.read_csv
doesn't support gz files from the cloud storage. Probably some issue with the metadata
I'm happy to look into it and work on this item.
I'm actually reading csv.gz
from local and AWS. Is this issue solved?