BUG: Raise on parse int overflow #47167
- [x] closes #47167
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [x] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.
@github-actions pre-commit
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.
I am still interested in this. I just added changes that adress requests raised by @jreback and @simonjayhawkins. Local tests run successfully.
I decided to add the check for engine="python" as I outlined above inside the base parser (with the only difference being that the check is also necessary for user dtype uint64). Concerning the question I raised about having an option to check for overflows in astype_nansafe, I came to the conclusion that this is probably unnecessary. The only relevant situation where I expect to consistently fail on insufficiently specified dtypes is parsing, i.e. what is handled in this PR.
Please reopen and review.
Thanks @mroeschke for reopening.
I just recognized that the changes in this PR has a good side effect.
Up to now, implicity lossy coercion of float to int could occur for engine="python" with non-extension integer dtype, i.e.
>>> import pandas as pd
>>> from io import StringIO
>>> pd.read_csv(StringIO("0.1"), header=None, dtype="int", engine="python").squeeze().item()
0
while these similar variants already raise errors
>>> pd.read_csv(StringIO("0.1"), header=None, dtype="Int64", engine="python")
Traceback (most recent call last):
...
TypeError: cannot safely cast non-equivalent float64 to int64
>>> pd.read_csv(StringIO("0.1"), header=None, dtype="int", engine="c")
Traceback (most recent call last):
...
ValueError: cannot safely convert passed user dtype of int64 for float64 dtyped data in column 0
The changes in this PR let all of the above variants raise an error, such that the "c" and "python" engine will behave consistently.
Two remarks concerning this:
- I had to change a test on
read_fwfthat assumed lossy float to int coercion. The exact behavior of float to int coercion due to givendtypeparameter is not explicitly documented forread_csv,read_fwf, andread_tableas far as I can see. So the present possibility of lossy coercion to int is not guaranteed. Imho a consistent, and safe behavior is better. - The behavior of
engine="pyarrow"is to perform silent lossy float to int coercion and integer overflows.
This PR might close https://github.com/pandas-dev/pandas/issues/38013
I think it is a good idea to perform the same check for engine="pyarrow"after this this cast. This is a bit more cumbersome, since the complete frame is casted with astype and e.g. using the safer maybe_cast_to_integer_array would need a bit of boilerplate.
Additionally, pyarrow returns values outside int64 range as float. Having a check here can guard the user against the pitfall that engine="pyarrow", dtype="uint64" seems to return precise values in the range (max_int64, max_uint64] but actually has errors due to the intermediate float representation, e.g.
>>> val = np.iinfo('int64').max + 1025
>>> val - pd.read_csv(StringIO(f"A\n{val}"), engine="pyarrow", dtype="uint64").squeeze().item()
1024
I will move this topic into a new issue in order to discuss and document it separately, as soon as I have some time available. I guess the scope of this PR is large enough, as it is.
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.