pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: pd.read_csv does not work with nullable_dtype coercion

Open MCRE-BE opened this issue 2 years ago • 4 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO
from textwrap import dedent
csv = dedent("""
    Site;Weight MD;CodeWeight;MAX SM;Pallet Equ.;Crate Equ.
    BW08;0,24;2;999;0,03125;0,14286
    BW08;0,24;2;999;0,03125;0,14286
    BW08;0,24;2;999;0,03125;0,14286
    BW01;0;0;999;0,00625;1
""")[1:]
csv_param = {
    'decimal' : ',',
    'sep' : ';',
    'encoding' : "utf-8"
}

# Examples that fail
layout = {
    'Site': 'string',
    'Weight MD': "Float64",
    'CodeWeight': "UInt8",
    'MAX SM': "Float64",
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

layout ={
    'Site': 'string',
    'Weight MD': pd.Float64Dtype,
    'CodeWeight':pd.UInt8Dtype,
    'MAX SM': pd.Int64Dtype,
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

layout ={
    'Site': 'string[pyarrow]',
    'Weight MD': 'double[pyarrow]',
    'CodeWeight':'int64[pyarrow]',
    'MAX SM': 'int64[pyarrow]',
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

Error thrown :

ValueError: Unable to parse string "0,24" at position 0

But it can read :

data = pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype_backend="numpy_nullable", **csv_param)
data.dtypes

Site          string[pyarrow]
Weight MD             Float64
CodeWeight              Int64
MAX SM                  Int64
dtype: object

Issue Description

I think it's related to #49146, the dtype coercion does not work with nullable dtypes.

Expected Behavior

It should be able to read them with the requested dtype

Installed Versions

INSTALLED VERSIONS

commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.10.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : English_Netherlands.1252

pandas : 2.0.0 numpy : 1.23.5 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 23.0.1 Cython : None pytest : 7.2.2 hypothesis : None sphinx : 6.1.3 blosc : None feather : None xlsxwriter : 3.0.9 lxml.etree : 4.9.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: 0.10.0 bs4 : 4.12.1 bottleneck : None brotli : fastparquet : None fsspec : 2023.3.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.0 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : 1.0.10 s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.19.0 tzdata : 2023.3 qtpy : 2.3.1 pyqt5 : None

MCRE-BE avatar Apr 11 '23 14:04 MCRE-BE

Hi, thanks for your report. I think we have an open issue about this (this looks familiar to me). Could you double check the issue tracker?

phofl avatar Apr 11 '23 21:04 phofl

Hey

Searching for "pd.read_csv" I went back to October 2022 without finding any mention to bugs related to dtype coercion.

What seems similar : #52086 : this one is the closest (bug horrible bug name). It seems that @hxy450 is solving it? But bug is open since > 3 weeks. #49146 : this mentions a solved bug, but its seems the fix does not work (?)

What I can find that could be related : #52301 : Likely it's a bug with how we pass some kwargs to an engine #52266 : Seems closely related as I don't understand why the comma is not being recognized with nullable dtypes

If 52086 is the same, we can close either this one or the other (then sorry for the double bug, but I didn't open every single bug to read the content 😄 )

MCRE-BE avatar Apr 12 '23 04:04 MCRE-BE

The bug is still present in 2.0.3.

MCRE-BE avatar Jul 27 '23 09:07 MCRE-BE

Bug is still present in 2.2.2

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Belgium.1252

pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0 setuptools : 72.2.0 pip : 24.2 Cython : None pytest : 8.3.2 hypothesis : None sphinx : 7.4.7 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.3.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.4.0 dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : 3.9.1 numba : 0.60.0 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.14.1 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.23.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None

MCRE-BE avatar Aug 28 '24 05:08 MCRE-BE

I did further investigations on this issue on main (commit 16b7288ecc).

If you explicitly specify engine you realize that the c engine is not being able to parse data when decimal is anything different from . AND dtype = pd.Float64Dtype at the same time (or dtype="Float64", which has the same result).

Regarding the test suit, there is a single test 'function' to test the c engine with 'delimiter' parameter and it can be run in a development environment with

> pytest pandas/tests/io/parser/test_c_parser_only.py::test_1000_sep_with_decimal

Within this test, by adding the argument dtype = pd.Float64Dtype on the call to read_csv() one obtains exactly the same error. In my opinion this issue and #52086 are the same while the others issues @MCRE-BE mentioned are related to different problems in the code. #52086 provides a better minimal reproducible example.

If relevant I can create and PR a test (marking it as failing) for this issue.

Anyway, one workaround is to use any engine besides c.

aureliobarbosa avatar Sep 12 '24 13:09 aureliobarbosa