ydata-profiling IndexError: index 3098 is out of bounds for axis 0 with size 3098 on Jupyter

IndexError: index 3098 is out of bounds for axis 0 with size 3098 on Jupyter

Open snknitin opened this issue 3 years ago • 1 comments

Current Behaviour

I have a huge data frame(900k rows) and since i received IndexErrors multiple times on the 3.1.0 version, I tried 3 things based on all the similar issues

Imputed all nans to median , so there are no nulls or nans now
Created small random sample of the data frame only taking 5% data df.sample(frac=0.05, random_state=3407)
Upgraded to 3.2.0

My pandas version is 1.3.4

This is the error I get when i run profile = ProfileReport(sampleData)(which executes quickly) and then try to view with profile.to_notebook_iframe() (which runs stages and then errors out):

lib/python3.7/site-packages/numpy/lib/histograms.py in histogram(a, bins, range, normed, weights, density)
    854             # The index computation is not guaranteed to give exactly
    855             # consistent results within ~1 ULP of the bin edges.
--> 856             decrement = tmp_a < bin_edges[indices]
    857             indices[decrement] -= 1
    858             # The last bin includes the right edge. The other bins do not.

IndexError: index 3098 is out of bounds for axis 0 with size 3098

Expected Behaviour

profile.to_notebook_iframe() should give me the profile report of the dataframe

Data Description

Dataset has 984289 rows × 118 columns, mostly float16(20), int16(44), int8(51), object(3) and memory usage: 190.6+ MB

Code that reproduces the bug

No response

pandas-profiling version

3.2.0

Dependencies

pandas==1.3.4
numpy==1.20.3

OS

Mac OS BigSur 11.6.7

Checklist

[X] There is not yet another bug report for this issue in the issue tracker
[X] The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
[X] The issue has not been resolved by the entries listed under Common Issues.

Aug 29 '22 11:08 snknitin

Update :

It seemed to be failing only on those columns that had their nans imputed. Removing those columns , made the profile report go further and didn't give the index error.
Upon further inspection , I noticed those columns were cast to float16 by a reduce memory utility function.
There is probably an overflow issue here at work which is why the numbers in the IndexError sometimes go crazy(as seen in some other bug reports and issues) and even after reducing the size of the dataframe, it still gives the error.

Aug 29 '22 15:08 snknitin

ydata-profiling ydata-profiling copied to clipboard

IndexError: index 3098 is out of bounds for axis 0 with size 3098 on Jupyter

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist

ydata-profiling
ydata-profiling copied to clipboard