BUG: pd.read_csv does not work with nullable_dtype coercion
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from io import StringIO
from textwrap import dedent
csv = dedent("""
Site;Weight MD;CodeWeight;MAX SM;Pallet Equ.;Crate Equ.
BW08;0,24;2;999;0,03125;0,14286
BW08;0,24;2;999;0,03125;0,14286
BW08;0,24;2;999;0,03125;0,14286
BW01;0;0;999;0,00625;1
""")[1:]
csv_param = {
'decimal' : ',',
'sep' : ';',
'encoding' : "utf-8"
}
# Examples that fail
layout = {
'Site': 'string',
'Weight MD': "Float64",
'CodeWeight': "UInt8",
'MAX SM': "Float64",
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)
layout ={
'Site': 'string',
'Weight MD': pd.Float64Dtype,
'CodeWeight':pd.UInt8Dtype,
'MAX SM': pd.Int64Dtype,
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)
layout ={
'Site': 'string[pyarrow]',
'Weight MD': 'double[pyarrow]',
'CodeWeight':'int64[pyarrow]',
'MAX SM': 'int64[pyarrow]',
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)
Error thrown :
ValueError: Unable to parse string "0,24" at position 0
But it can read :
data = pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype_backend="numpy_nullable", **csv_param)
data.dtypes
Site string[pyarrow]
Weight MD Float64
CodeWeight Int64
MAX SM Int64
dtype: object
Issue Description
I think it's related to #49146, the dtype coercion does not work with nullable dtypes.
Expected Behavior
It should be able to read them with the requested dtype
Installed Versions
INSTALLED VERSIONS
commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.10.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : English_Netherlands.1252
pandas : 2.0.0 numpy : 1.23.5 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 23.0.1 Cython : None pytest : 7.2.2 hypothesis : None sphinx : 6.1.3 blosc : None feather : None xlsxwriter : 3.0.9 lxml.etree : 4.9.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: 0.10.0 bs4 : 4.12.1 bottleneck : None brotli : fastparquet : None fsspec : 2023.3.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.0 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : 1.0.10 s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.19.0 tzdata : 2023.3 qtpy : 2.3.1 pyqt5 : None
Hi, thanks for your report. I think we have an open issue about this (this looks familiar to me). Could you double check the issue tracker?
Hey
Searching for "pd.read_csv" I went back to October 2022 without finding any mention to bugs related to dtype coercion.
What seems similar : #52086 : this one is the closest (bug horrible bug name). It seems that @hxy450 is solving it? But bug is open since > 3 weeks. #49146 : this mentions a solved bug, but its seems the fix does not work (?)
What I can find that could be related : #52301 : Likely it's a bug with how we pass some kwargs to an engine #52266 : Seems closely related as I don't understand why the comma is not being recognized with nullable dtypes
If 52086 is the same, we can close either this one or the other (then sorry for the double bug, but I didn't open every single bug to read the content 😄 )
The bug is still present in 2.0.3.
Bug is still present in 2.2.2
INSTALLED VERSIONS
commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Dutch_Belgium.1252
pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0 setuptools : 72.2.0 pip : 24.2 Cython : None pytest : 8.3.2 hypothesis : None sphinx : 7.4.7 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.3.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.4.0 dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : 3.9.1 numba : 0.60.0 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.14.1 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : 2.0.1 zstandard : 0.23.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None
I did further investigations on this issue on main (commit 16b7288ecc).
If you explicitly specify engine you realize that the c engine is not being able to parse data when decimal is anything different from . AND dtype = pd.Float64Dtype at the same time (or dtype="Float64", which has the same result).
Regarding the test suit, there is a single test 'function' to test the c engine with 'delimiter' parameter and it can be run in a development environment with
> pytest pandas/tests/io/parser/test_c_parser_only.py::test_1000_sep_with_decimal
Within this test, by adding the argument dtype = pd.Float64Dtype on the call to read_csv() one obtains exactly the same error. In my opinion this issue and #52086 are the same while the others issues @MCRE-BE mentioned are related to different problems in the code. #52086 provides a better minimal reproducible example.
If relevant I can create and PR a test (marking it as failing) for this issue.
Anyway, one workaround is to use any engine besides c.