pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: read_csv failure to convert dtype is not considered a 'bad line'

Open thijssnelleman opened this issue 1 month ago • 5 comments

Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import io

csv_text = "column1,column2,column3\n1,2,3\nIAMAWRONGLINE\na,4,5"
buffer = io.StringIO(csv_text)

df = pd.read_csv(buffer, header=0, on_bad_lines="skip", dtype={"column1": int, "column2": int, "column3": int})

"""Output:
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1161, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/snelleman/.venv/sparkle/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/snelleman/.venv/sparkle/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
  File "/home/snelleman/.venv/sparkle/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/snelleman/.venv/sparkle/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 921, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1066, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1167, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: 'IAMAWRONGLINE'
"""

Issue Description

I would expect in this case that the line would be skipped as it does not comply with the formatting. In a similar situation I got the error message:

" raise ValueError("Trying to coerce float values to integers") ValueError: Trying to coerce float values to integers"

or

" raise IntCastingNaNError( pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer"

Am I misunderstanding how this argument works? In my case it would be very useful to skip these bad lines as well! :)

Expected Behavior

I would expect the on_bad_lines callable to be triggered by these issues, as not complying with the dtypes is in my opinion a bad line. Perhaps the Pandas team has a different view?

Installed Versions

INSTALLED VERSIONS

commit : 9c8bc3e55188c8aff37207a74f1dd144980b8874 python : 3.10.8 python-bits : 64 OS : Linux OS-release : 5.14.0-427.16.1.el9_4.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Wed May 8 17:48:14 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.3.3 numpy : 1.26.4 pytz : 2025.2 dateutil : 2.9.0.post0 pip : 22.2.2 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2025.9.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.6 lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.15.3 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2025.2 qtpy : None pyqt5 : None

thijssnelleman avatar Nov 21 '25 12:11 thijssnelleman

Looks like this issue exists for all engines (C, Python and PyArrow). The main problem is that it reads the value as a string, then it tries to convert to integer with the astype method, without handling the error in case of bad lines.

This issue feels similar to one reported in Arrow: https://github.com/apache/arrow/issues/32163.

Alvaro-Kothe avatar Nov 22 '25 16:11 Alvaro-Kothe

@Alvaro-Kothe is it alright if I can work out this issue?

nejail avatar Nov 22 '25 19:11 nejail

@nejail Sure. Go ahead.

Alvaro-Kothe avatar Nov 22 '25 20:11 Alvaro-Kothe

take

anishkarki avatar Nov 23 '25 01:11 anishkarki

Hi,I’d like to help fix this issue. Is it okay if I work on it? @Alvaro-Kothe

Aokizy2 avatar Dec 06 '25 08:12 Aokizy2