read_csv on gzipped csv much slower if n_rows specified
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Generate 21MB compressed csv with 10k cols & 1k rows
#!/bin/env bash
rm -f row.csv
rm -f big.csv
rm -f big.csv.gz
COLS=10000
for ((i=1; i <= COLS; i++)); do
echo -n "$i" >> row.csv
if [[ i -lt $COLS ]]; then
echo -n "," >> row.csv
else
echo "" >> row.csv
fi
done
for _ in {1..1000}; do
cat row.csv >> big.csv
done
gzip big.csv
import os
import polars as pl
path = "big.csv.gz"
if "N_ROWS" in os.environ:
n_rows = int(os.environ["N_ROWS"])
else:
n_rows = None
df = pl.read_csv(path, n_rows=n_rows)
print(len(df))
Log output
reading all rows is fast
time python nrows_bug.py
999
0.39s user 0.09s system 101% cpu 0.474 total
With small n_rows is faster. This is ok
time N_ROWS=10 python nrows_bug.py
10
0.16s user 0.04s system 104% cpu 0.195 total
But increasing n_rows, it gradually takes much longer than reading the full file
time N_ROWS=100 python nrows_bug.py
100
0.70s user 0.04s system 100% cpu 0.736 total
time N_ROWS=1000 python nrows_bug.py
999
5.58s user 0.09s system 100% cpu 5.658 total
Issue description
read_csv on gzipped csv with many columns (> 10k) is much slower (x10 on the example) if n_rows is specified. Related to https://github.com/pola-rs/polars/issues/10579?
Expected behavior
Specifying fewer rows than contained in the csv file should not be slower than reading the full file.
Installed versions
--------Version info---------
Polars: 1.5.0
Index type: UInt32
Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39
Python: 3.12.3 (main, Apr 11 2024, 10:16:04) [GCC 13.2.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: 2024.6.1
gevent: <not installed>
great_tables: <not installed>
hvplot: <not installed>
matplotlib: 3.9.2
nest_asyncio: 1.6.0
numpy: 1.26.4
openpyxl: <not installed>
pandas: 2.2.2
pyarrow: 17.0.0
pydantic: <not installed>
pyiceberg: <not installed>
sqlalchemy: <not installed>
torch: 2.4.0+cu121
xlsx2csv: <not installed>
xlsxwriter: <not installed>
@dpinol See: https://github.com/pola-rs/polars/issues/18724#issuecomment-2657400855
@dpinol can you please verify that this is unique to compressed CSVs and that this slowdown doesn't happen with non compressed CSVs.