polars icon indicating copy to clipboard operation
polars copied to clipboard

read_csv on gzipped csv much slower if n_rows specified

Open dpinol opened this issue 1 year ago • 2 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Generate 21MB compressed csv with 10k cols & 1k rows

#!/bin/env bash

rm -f row.csv
rm -f big.csv
rm -f big.csv.gz

COLS=10000
for ((i=1; i <= COLS; i++)); do
    echo -n "$i" >> row.csv
    if [[ i -lt $COLS ]]; then
        echo -n "," >> row.csv
    else
        echo "" >> row.csv
    fi
done

for _ in {1..1000}; do
    cat row.csv >> big.csv
done
gzip big.csv
import os

import polars as pl

path = "big.csv.gz"
if "N_ROWS" in os.environ:
    n_rows = int(os.environ["N_ROWS"])
else:
    n_rows = None
df = pl.read_csv(path, n_rows=n_rows)
print(len(df))

Log output

reading all rows is fast


time python nrows_bug.py 
999
0.39s user 0.09s system 101% cpu 0.474 total

With small n_rows is faster. This is ok

time N_ROWS=10 python nrows_bug.py
10
0.16s user 0.04s system 104% cpu 0.195 total

But increasing n_rows, it gradually takes much longer than reading the full file

time N_ROWS=100 python nrows_bug.py
100
0.70s user 0.04s system 100% cpu 0.736 total

time N_ROWS=1000 python nrows_bug.py
999
5.58s user 0.09s system 100% cpu 5.658 total

Issue description

read_csv on gzipped csv with many columns (> 10k) is much slower (x10 on the example) if n_rows is specified. Related to https://github.com/pola-rs/polars/issues/10579?

Expected behavior

Specifying fewer rows than contained in the csv file should not be slower than reading the full file.

Installed versions

--------Version info---------
Polars:               1.5.0
Index type:           UInt32
Platform:             Linux-6.8.0-31-generic-x86_64-with-glibc2.39
Python:               3.12.3 (main, Apr 11 2024, 10:16:04) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.9.2
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                2.4.0+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

dpinol avatar Aug 23 '24 06:08 dpinol

@dpinol See: https://github.com/pola-rs/polars/issues/18724#issuecomment-2657400855

ghuls avatar Feb 13 '25 18:02 ghuls

@dpinol can you please verify that this is unique to compressed CSVs and that this slowdown doesn't happen with non compressed CSVs.

Voultapher avatar Dec 11 '25 16:12 Voultapher