polars icon indicating copy to clipboard operation
polars copied to clipboard

pl.read_csv() does not recognize U/Int8, U/Int16 dtypes

Open mcrumiller opened this issue 2 years ago • 3 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

When providing 1- and 2-byte integer dtypes to pl.read_csv(), a ComputeError is raised.

Reproducible example

import polars as pl
from io import BytesIO

for dtype in [pl.UInt8, pl.Int8, pl.UInt16, pl.Int16, pl.UInt32, pl.Int32,  pl.UInt64, pl.Int64]:
    buffer = BytesIO()
    pl.DataFrame({'a': pl.Series([1, 2, 3, 4, 5], dtype=dtype)}).write_csv(buffer)

    try:
        pl.read_csv(buffer, dtypes=[dtype])
    except Exception as e:
        print(e)

    output.close()
Unsupported data type UInt8 when reading a csv
Unsupported data type Int8 when reading a csv
Unsupported data type UInt16 when reading a csv
Unsupported data type Int16 when reading a csv

Expected behavior

No output

Installed versions

---Version info---
Polars: 0.14.18
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.9.5 (tags/v3.9.5:0a7dcbd, May  3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 8.0.0
pandas: 1.4.3
numpy: 1.23.1
fsspec: <not installed>
connectorx: 0.3.0
xlsx2csv: 0.8
matplotlib: 3.6.1

mcrumiller avatar Oct 14 '22 17:10 mcrumiller

Was there an hack to fix the issue in the Python API? It's odd to me that the Python API read_csv does support U/Int8 and U/Int16 types now. However, the Rust API CsvReader::with_dtypes does NOT support U/Int8, U/Int16 types.

dclong avatar Jan 02 '23 01:01 dclong

Did you activate the dtype features?

ritchie46 avatar Jan 02 '23 11:01 ritchie46

@ritchie46,

You were right! Including the dtype-full feature made it work.

dclong avatar Jan 02 '23 18:01 dclong

While read_csv now supports reading 1 & 2-byte integers, both read_csv_batched and scan_csv report that it is not supported: Unsupported data type Int8 when reading a csv Scanning as 4-byte integers and downcasting before a collect() uses significantly more memory (e.g. for large arrays of genoytpes). Workaround for now is to use read_csv_batched in smaller chunks.

tikkanz avatar Feb 16 '23 02:02 tikkanz

While read_csv now supports reading 1 & 2-byte integers, both read_csv_batched and scan_csv report that it is not supported: Unsupported data type Int8 when reading a csv Scanning as 4-byte integers and downcasting before a collect() uses significantly more memory (e.g. for large arrays of genoytpes). Workaround for now is to use read_csv_batched in smaller chunks.

The problem is that LazyCsvReader::with_dtype_overwrite does not support these datatypes, while CsvReader::with_dtypes does.

ptiza avatar Mar 01 '23 22:03 ptiza

Closed via #7290.

mcrumiller avatar Jun 08 '23 19:06 mcrumiller