ydata-profiling icon indicating copy to clipboard operation
ydata-profiling copied to clipboard

Profile Report Warning

Open josephramon opened this issue 2 years ago • 4 comments

I have this dataset - SBAnational.csv from https://www.kaggle.com/datasets/mirbektoktogaraev/should-this-loan-be-approved-or-denied?select=SBAnational.csv

Describe the bug #No problem with this line: sba.profile_report(title='SBA Pandas Profiling Report', progress_bar=True)

#However, if I use: (save to a variable) profile = sba.profile_report(title='SBA Pandas Profiling Report', progress_bar=True)

It shows a RunTime warning:

C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\stats.py:4812: RuntimeWarning: overflow encountered in longlong_scalars
  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))

I traced this to an issue in correlation using Kendall's. If I don't include Kendall's, no problem.

To Reproduce

Just download dataset from the link above and run the commands shown.

import pandas as pd


from pandas_profiling import ProfileReport


if __name__ == "__main__":
    df = pd.read_csv("SBAnational.csv",  engine="python", on_bad_lines='skip')
    profile = ProfileReport(df, title="Pandas Profiling Report")
    profile.to_notebook_iframe()

josephramon avatar Mar 29 '22 08:03 josephramon

@josephramon Thanks for reporting this issue. With the latest version of the package and the code below, I could not reproduce the issue (Kendall's is disabled, and no warning). The line without problems is idle - the package is lazy unless any action is taken, such as printing the report or writing it to a file.

With the default type inference, all variables are recognized as categoricals. Other correlation matrices are computed (e.g. PhiK), the necessary insights for the user are available.

It would be worthwhile diving deeper into this dataset to see if the root cause for your issue could be found. For now other features/bugs take priority. Help however is welcome.

import pandas as pd

import matplotlib
# Needed on Colab for this dataset
matplotlib.use('Agg')

from pandas_profiling import ProfileReport


if __name__ == "__main__":
    df = pd.read_csv("SBAnational.csv",  engine="python", on_bad_lines='skip')
    profile = ProfileReport(df, title="Pandas Profiling Report", plot={'image_format':'png'}, explorative=True)
    profile.to_file('test.html')

sbrugman avatar May 07 '22 19:05 sbrugman

Today I updated to pandas profiling V3.3.0 on a Windows environment with Python 3.9.12 because of another fixed issue in the pandas profiling library, but now have got the same RuntimeWarning as mentioned by Joseph Ramon by using this dataset page: https://unfallatlas.statistikportal.de/_opendata2022.html with Download Unfallorte 2021 - CSV-Format (zip) => read in with delimiter ';'

And coding: preprocessing to change the German number formats to English ones: convert_dict = { 'XGCSWGS84' : float, 'YGCSWGS84' : float, 'LINREFX' : float, 'LINREFY' : float }

crash_df['XGCSWGS84'] = crash_df['XGCSWGS84'].apply(lambda x: x.replace('.','').replace(',','.')) crash_df['YGCSWGS84'] = crash_df['YGCSWGS84'].apply(lambda x: x.replace('.','').replace(',','.')) crash_df['LINREFX'] = crash_df['LINREFX'].apply(lambda x: x.replace('.','').replace(',','.')) crash_df['LINREFY'] = crash_df['LINREFY'].apply(lambda x: x.replace('.','').replace(',','.'))

crash_df[['XGCSWGS84', 'YGCSWGS84', 'LINREFX', 'LINREFY']] = crash_df[
['XGCSWGS84', 'YGCSWGS84', 'LINREFX', 'LINREFY']].astype(convert_dict)

report: crash_2021_profile = ProfileReport(crash_df, title='Destatis Unfallstatistik 2021 Profiling Report') crash_2021_profile.to_file('./reports/unfallstatistik2021_profiling_report.html')

But after having a look on the terminal there is another message: warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" C:\anaconda\Anaconda3\envs<...>PoC\lib\site-packages\scipy_init.py:146: User Warning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of Sc iPy (detected version 1.23.1

Could that be the root cause?

IloBe avatar Sep 19 '22 09:09 IloBe

hi @IloBe, it seems an error on pandas profiling dependencies, can you install numpy==1.22.4 and check if it runs correctly?

alexbarros avatar Sep 19 '22 10:09 alexbarros

@alexbarros I tried it with numpy==1.22.4, but having had the same warning: RuntimeWarning: overflow encountered in longlong_scalars (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))

So, numpy version is not the root cause.

IloBe avatar Sep 19 '22 13:09 IloBe