polars icon indicating copy to clipboard operation
polars copied to clipboard

Cannot properly read `csv.gz` file in Google Storage bucket

Open sndpgm opened this issue 1 year ago • 5 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

csv.gz file in Google Storage (GS) bucket cannot be properly read using pl.read_csv. The results appear to be garbled:

shape: (0, 1)
┌──────────────────────────────┐
│ �6�d�test_a1.csv J�I�I�J�1ԩ… │
│ ---                          │
│ str                          │
╞══════════════════════════════╡
└──────────────────────────────┘

Reproducible example

>>> import polars as pl
>>> df_ng = pl.read_csv("gs://my_bucket/path/to/test_a1.csv.gz")
>>> df_ng
shape: (0, 1)
┌──────────────────────────────┐
│ �6�d�test_a1.csv J�I�I�J�1ԩ… │
│ ---                          │
│ str                          │
╞══════════════════════════════╡
└──────────────────────────────┘

Expected behavior

The expected results are the ones of reading the same file in local PC (file data is linked in the bellow):

# reading csv.gz in local PC is no problem
>>> df_local = pl.read_csv("/my_pc/path/to/test_a1.csv.gz")
>>> df_local
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ a   ┆ 1   ┆ x   │
│ a   ┆ 1   ┆ y   │
└─────┴─────┴─────┘

test_a1.csv.gz

Installed versions

--------Version info---------
Polars:      0.18.3
Index type:  UInt32
Platform:    macOS-13.4-arm64-arm-64bit
Python:      3.10.11 (main, May 17 2023, 14:30:36) [Clang 14.0.6 ]
----Optional dependencies----
numpy:       1.25.0
pandas:      2.0.2
pyarrow:     12.0.1
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      2023.6.0
matplotlib:  <not installed>
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>

sndpgm avatar Jun 21 '23 01:06 sndpgm

The similar result has occurred in AWS S3.

>>> import polars as pl
>>> df_ng = pl.read_csv("s3://my_bucket/path/to/test_a1.csv.gz")
>>> df_ng
shape: (0, 1)
┌─────────────────────────────┐
│ <�d�test_a1.csv J�I�I�J�1ԩ… │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
└─────────────────────────────┘

sndpgm avatar Jun 21 '23 12:06 sndpgm

First decompress the file.

ritchie46 avatar Jun 21 '23 13:06 ritchie46

pl.read_csv does not support compressed files?

sndpgm avatar Jun 21 '23 15:06 sndpgm

Looks like pl.read_csv doesn't support gz files from the cloud storage. Probably some issue with the metadata

I'm happy to look into it and work on this item.

SridharCR avatar Jul 04 '23 04:07 SridharCR

I'm actually reading csv.gz from local and AWS. Is this issue solved?

29antonioac avatar Apr 28 '24 19:04 29antonioac