pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Raise on parse int overflow #47167

Open SandroCasagrande opened this issue 3 years ago • 8 comments

  • [x] closes #47167
  • [x] Tests added and passed if fixing a bug or adding a new feature
  • [x] All code checks passed.
  • [x] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

SandroCasagrande avatar May 30 '22 09:05 SandroCasagrande

@github-actions pre-commit

SandroCasagrande avatar May 30 '22 09:05 SandroCasagrande

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

github-actions[bot] avatar Jul 07 '22 00:07 github-actions[bot]

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

mroeschke avatar Jul 22 '22 17:07 mroeschke

I am still interested in this. I just added changes that adress requests raised by @jreback and @simonjayhawkins. Local tests run successfully.

I decided to add the check for engine="python" as I outlined above inside the base parser (with the only difference being that the check is also necessary for user dtype uint64). Concerning the question I raised about having an option to check for overflows in astype_nansafe, I came to the conclusion that this is probably unnecessary. The only relevant situation where I expect to consistently fail on insufficiently specified dtypes is parsing, i.e. what is handled in this PR.

Please reopen and review.

SandroCasagrande avatar Jul 23 '22 23:07 SandroCasagrande

Thanks @mroeschke for reopening.

SandroCasagrande avatar Jul 24 '22 22:07 SandroCasagrande

I just recognized that the changes in this PR has a good side effect.

Up to now, implicity lossy coercion of float to int could occur for engine="python" with non-extension integer dtype, i.e.

>>> import pandas as pd
>>> from io import StringIO
>>> pd.read_csv(StringIO("0.1"), header=None, dtype="int", engine="python").squeeze().item()
0

while these similar variants already raise errors

>>> pd.read_csv(StringIO("0.1"), header=None, dtype="Int64", engine="python")
Traceback (most recent call last):
  ...
TypeError: cannot safely cast non-equivalent float64 to int64
>>> pd.read_csv(StringIO("0.1"), header=None, dtype="int", engine="c")
Traceback (most recent call last):
  ...
ValueError: cannot safely convert passed user dtype of int64 for float64 dtyped data in column 0

The changes in this PR let all of the above variants raise an error, such that the "c" and "python" engine will behave consistently.

Two remarks concerning this:

  • I had to change a test on read_fwf that assumed lossy float to int coercion. The exact behavior of float to int coercion due to given dtype parameter is not explicitly documented for read_csv, read_fwf, and read_table as far as I can see. So the present possibility of lossy coercion to int is not guaranteed. Imho a consistent, and safe behavior is better.
  • The behavior of engine="pyarrow" is to perform silent lossy float to int coercion and integer overflows.

SandroCasagrande avatar Jul 24 '22 22:07 SandroCasagrande

This PR might close https://github.com/pandas-dev/pandas/issues/38013

SandroCasagrande avatar Jul 25 '22 22:07 SandroCasagrande

I think it is a good idea to perform the same check for engine="pyarrow"after this this cast. This is a bit more cumbersome, since the complete frame is casted with astype and e.g. using the safer maybe_cast_to_integer_array would need a bit of boilerplate.

Additionally, pyarrow returns values outside int64 range as float. Having a check here can guard the user against the pitfall that engine="pyarrow", dtype="uint64" seems to return precise values in the range (max_int64, max_uint64] but actually has errors due to the intermediate float representation, e.g.

>>> val = np.iinfo('int64').max + 1025
>>> val - pd.read_csv(StringIO(f"A\n{val}"), engine="pyarrow", dtype="uint64").squeeze().item()
1024

I will move this topic into a new issue in order to discuss and document it separately, as soon as I have some time available. I guess the scope of this PR is large enough, as it is.

SandroCasagrande avatar Jul 25 '22 22:07 SandroCasagrande

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

mroeschke avatar Jan 12 '23 23:01 mroeschke