pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: on_bad_lines=callable does not invoke callable for all bad lines

Open indigoviolet opened this issue 3 years ago • 0 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [29]:
import pandas as pd
pd.__version__
Out [29]:
'1.4.3'

In [30]:
len(open("bad.csv").readlines())
Out [30]:
3

In [31]:
df1 = pd.read_csv("bad.csv", on_bad_lines='warn', engine='python')
Skipping line 3: ',' expected after '"'


In [32]:
df2 = pd.read_csv("bad.csv", on_bad_lines=print, engine='python')

In [33]:
len(df1), len(df2)
Out [33]:
(1, 1)

Issue Description

The above data file has two rows + header. Row 2 is valid, Row 3 is bad.

For df1, I'm setting on_bad_line=warn, and I see a warning for line 3.

For d2, I'm passing on_bad_lines=print, and I don't see any prints - the bad line is silently skipped.

❯ cat bad.csv
country,founded,id,industry,linkedin_url,locality,name,region,size,website
united states,"",heritage-equine-equipment-llc,farming,linkedin.com/company/heritage-equine-equipment-llc,"",heritage equine equipment llc,"",1-10,heritageequineequip.com
chile,"",contacto-corporación-colina,hospital & health care,linkedin.com/company/contacto-corporación-colina,colina,"contacto \" corporación colina",santiago metropolitan,11-50,corporacioncolina.cl

Expected Behavior

I would expect the bad line to be printed in the second case.

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.9.12.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-49-generic Version : #55-Ubuntu SMP Wed Jan 12 17:36:34 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 60.6.0 pip : 22.0.3 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.4.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None /home/venky/dev/instant-science/explore/.venv/lib/python3.9/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

indigoviolet avatar Aug 10 '22 05:08 indigoviolet

Hi, thanks for your report, can reproduce this too.

could you try simplifying the csv file? It’s hard to see what’s going on in there right now

phofl avatar Aug 10 '22 08:08 phofl

This may be working as expected if I am looking at your csv file correctly.

As the docs state:

Specifies what to do upon encountering a bad line (a line with too many fields).

And I think each line has the same number of elements?

mroeschke avatar Aug 10 '22 17:08 mroeschke

image

  1. Which lines are considered bad should not be different between 'warn' and print.

  2. I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.

indigoviolet avatar Aug 10 '22 18:08 indigoviolet

Hi, thanks for your report, can reproduce this too.

could you try simplifying the csv file? It’s hard to see what’s going on in there right now

Here's a simplified version:

❯ cat bad2.csv
country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"

Setting escapechar='\\' will allow reading the second line, but the bug (different behavior b/w warn and print) as reported is still valid.

indigoviolet avatar Aug 10 '22 18:08 indigoviolet