polars
polars copied to clipboard
pl.read_csv() does not recognize U/Int8, U/Int16 dtypes
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
When providing 1- and 2-byte integer dtypes to pl.read_csv()
, a ComputeError
is raised.
Reproducible example
import polars as pl
from io import BytesIO
for dtype in [pl.UInt8, pl.Int8, pl.UInt16, pl.Int16, pl.UInt32, pl.Int32, pl.UInt64, pl.Int64]:
buffer = BytesIO()
pl.DataFrame({'a': pl.Series([1, 2, 3, 4, 5], dtype=dtype)}).write_csv(buffer)
try:
pl.read_csv(buffer, dtypes=[dtype])
except Exception as e:
print(e)
output.close()
Unsupported data type UInt8 when reading a csv
Unsupported data type Int8 when reading a csv
Unsupported data type UInt16 when reading a csv
Unsupported data type Int16 when reading a csv
Expected behavior
No output
Installed versions
---Version info---
Polars: 0.14.18
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 8.0.0
pandas: 1.4.3
numpy: 1.23.1
fsspec: <not installed>
connectorx: 0.3.0
xlsx2csv: 0.8
matplotlib: 3.6.1
Was there an hack to fix the issue in the Python API? It's odd to me that the Python API read_csv
does support U/Int8 and U/Int16 types now. However, the Rust API CsvReader::with_dtypes
does NOT support U/Int8, U/Int16 types.
Did you activate the dtype features?
@ritchie46,
You were right! Including the dtype-full
feature made it work.
While read_csv
now supports reading 1 & 2-byte integers, both read_csv_batched
and scan_csv
report that it is not supported: Unsupported data type Int8 when reading a csv
Scanning as 4-byte integers and downcasting before a collect()
uses significantly more memory (e.g. for large arrays of genoytpes). Workaround for now is to use read_csv_batched
in smaller chunks.
While
read_csv
now supports reading 1 & 2-byte integers, bothread_csv_batched
andscan_csv
report that it is not supported:Unsupported data type Int8 when reading a csv
Scanning as 4-byte integers and downcasting before acollect()
uses significantly more memory (e.g. for large arrays of genoytpes). Workaround for now is to useread_csv_batched
in smaller chunks.
The problem is that LazyCsvReader::with_dtype_overwrite
does not support these datatypes, while CsvReader::with_dtypes
does.
Closed via #7290.