ydata-profiling
ydata-profiling copied to clipboard
IndexError: index 3098 is out of bounds for axis 0 with size 3098 on Jupyter
Current Behaviour
I have a huge data frame(900k rows) and since i received IndexErrors multiple times on the 3.1.0 version, I tried 3 things based on all the similar issues
- Imputed all nans to median , so there are no nulls or nans now
- Created small random sample of the data frame only taking 5% data
df.sample(frac=0.05, random_state=3407) - Upgraded to
3.2.0
My pandas version is 1.3.4
This is the error I get when i run profile = ProfileReport(sampleData)(which executes quickly) and then try to view with profile.to_notebook_iframe() (which runs stages and then errors out):
lib/python3.7/site-packages/numpy/lib/histograms.py in histogram(a, bins, range, normed, weights, density)
854 # The index computation is not guaranteed to give exactly
855 # consistent results within ~1 ULP of the bin edges.
--> 856 decrement = tmp_a < bin_edges[indices]
857 indices[decrement] -= 1
858 # The last bin includes the right edge. The other bins do not.
IndexError: index 3098 is out of bounds for axis 0 with size 3098
Expected Behaviour
profile.to_notebook_iframe() should give me the profile report of the dataframe
Data Description
Dataset has 984289 rows × 118 columns, mostly float16(20), int16(44), int8(51), object(3) and memory usage: 190.6+ MB
Code that reproduces the bug
No response
pandas-profiling version
3.2.0
Dependencies
pandas==1.3.4
numpy==1.20.3
OS
Mac OS BigSur 11.6.7
Checklist
- [X] There is not yet another bug report for this issue in the issue tracker
- [X] The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- [X] The issue has not been resolved by the entries listed under Common Issues.
Update :
- It seemed to be failing only on those columns that had their nans imputed. Removing those columns , made the profile report go further and didn't give the index error.
- Upon further inspection , I noticed those columns were cast to
float16by a reduce memory utility function. - There is probably an overflow issue here at work which is why the numbers in the IndexError sometimes go crazy(as seen in some other bug reports and issues) and even after reducing the size of the dataframe, it still gives the error.