ydata-profiling
ydata-profiling copied to clipboard
Profile Report Warning
I have this dataset - SBAnational.csv from https://www.kaggle.com/datasets/mirbektoktogaraev/should-this-loan-be-approved-or-denied?select=SBAnational.csv
Describe the bug #No problem with this line: sba.profile_report(title='SBA Pandas Profiling Report', progress_bar=True)
#However, if I use: (save to a variable) profile = sba.profile_report(title='SBA Pandas Profiling Report', progress_bar=True)
It shows a RunTime warning:
C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\stats.py:4812: RuntimeWarning: overflow encountered in longlong_scalars
(2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))
I traced this to an issue in correlation using Kendall's. If I don't include Kendall's, no problem.
To Reproduce
Just download dataset from the link above and run the commands shown.
import pandas as pd
from pandas_profiling import ProfileReport
if __name__ == "__main__":
df = pd.read_csv("SBAnational.csv", engine="python", on_bad_lines='skip')
profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_notebook_iframe()
@josephramon Thanks for reporting this issue. With the latest version of the package and the code below, I could not reproduce the issue (Kendall's is disabled, and no warning). The line without problems is idle - the package is lazy unless any action is taken, such as printing the report or writing it to a file.
With the default type inference, all variables are recognized as categoricals. Other correlation matrices are computed (e.g. PhiK), the necessary insights for the user are available.
It would be worthwhile diving deeper into this dataset to see if the root cause for your issue could be found. For now other features/bugs take priority. Help however is welcome.
import pandas as pd
import matplotlib
# Needed on Colab for this dataset
matplotlib.use('Agg')
from pandas_profiling import ProfileReport
if __name__ == "__main__":
df = pd.read_csv("SBAnational.csv", engine="python", on_bad_lines='skip')
profile = ProfileReport(df, title="Pandas Profiling Report", plot={'image_format':'png'}, explorative=True)
profile.to_file('test.html')
Today I updated to pandas profiling V3.3.0 on a Windows environment with Python 3.9.12 because of another fixed issue in the pandas profiling library, but now have got the same RuntimeWarning as mentioned by Joseph Ramon by using this dataset page: https://unfallatlas.statistikportal.de/_opendata2022.html with Download Unfallorte 2021 - CSV-Format (zip) => read in with delimiter ';'
And coding: preprocessing to change the German number formats to English ones: convert_dict = { 'XGCSWGS84' : float, 'YGCSWGS84' : float, 'LINREFX' : float, 'LINREFY' : float }
crash_df['XGCSWGS84'] = crash_df['XGCSWGS84'].apply(lambda x: x.replace('.','').replace(',','.')) crash_df['YGCSWGS84'] = crash_df['YGCSWGS84'].apply(lambda x: x.replace('.','').replace(',','.')) crash_df['LINREFX'] = crash_df['LINREFX'].apply(lambda x: x.replace('.','').replace(',','.')) crash_df['LINREFY'] = crash_df['LINREFY'].apply(lambda x: x.replace('.','').replace(',','.'))
crash_df[['XGCSWGS84', 'YGCSWGS84', 'LINREFX', 'LINREFY']] = crash_df[
['XGCSWGS84', 'YGCSWGS84', 'LINREFX', 'LINREFY']].astype(convert_dict)
report: crash_2021_profile = ProfileReport(crash_df, title='Destatis Unfallstatistik 2021 Profiling Report') crash_2021_profile.to_file('./reports/unfallstatistik2021_profiling_report.html')
But after having a look on the terminal there is another message: warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" C:\anaconda\Anaconda3\envs<...>PoC\lib\site-packages\scipy_init.py:146: User Warning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of Sc iPy (detected version 1.23.1
Could that be the root cause?
hi @IloBe, it seems an error on pandas profiling dependencies, can you install numpy==1.22.4
and check if it runs correctly?
@alexbarros I tried it with numpy==1.22.4, but having had the same warning: RuntimeWarning: overflow encountered in longlong_scalars (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))
So, numpy version is not the root cause.