ydata-profiling
ydata-profiling copied to clipboard
Comparing datetime and str columns crashes with TypeError
Current Behaviour
When comparing simple datasets where one has a column with type datetime, and the other corresponding column has type string, compare()
crashes. Interestingly, calling compare on the string data report works, but calling compare on the datetime data crashes (i.e., datatime_data_report.compare(string_data_report)
crashes, the other way does not.).
Code to reproduce below, here is the output.
/Users/jk/progs/profile-test/venv/bin/python /Users/jk/progs/profile-test/ydata_bugs.py
Python 3.11.4 (main, Jun 29 2023, 21:37:20) [Clang 12.0.0 (clang-1200.0.32.29)]
Pandas 2.0.3
ydata_profiling v4.3.1
Summarize dataset: 100%|██████████| 9/9 [00:00<00:00, 42.25it/s, Completed]
Summarize dataset: 100%|██████████| 9/9 [00:00<00:00, 67.19it/s, Completed]
-----
Summarize dataset: 100%|██████████| 9/9 [00:00<00:00, 67.49it/s, Completed]
Summarize dataset: 0%| | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/summary_algorithms.py", line 68, in inner
return fn(config, series, summary)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/summary_algorithms.py", line 85, in inner
return fn(config, series, summary)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/describe_date_pandas.py", line 34, in pandas_describe_date_1d
"min": pd.Timestamp.to_pydatetime(series.min()),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: descriptor 'to_pydatetime' for 'pandas._libs.tslibs.timestamps._Timestamp' objects doesn't apply to a 'str' object
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 57, in pandas_describe_1d
return summarizer.summarize(config, series, dtype=vtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/summarizer.py", line 42, in summarize
_, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 62, in handle
return op(*args)
^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 21, in func2
return f(*res)
^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 21, in func2
return f(*res)
^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 21, in func2
return f(*res)
^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/handler.py", line 17, in func2
res = g(*x)
^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 317, in __call__
raise DispatchError(f"Function {func.__code__}") from ex
multimethod.DispatchError: Function <code object inner at 0x134307960, file "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/summary_algorithms.py", line 62>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 99, in pandas_get_series_descriptions
for i, (column, description) in enumerate(
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 79, in multiprocess_1d
return column, describe_1d(config, series, summarizer, typeset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 317, in __call__
raise DispatchError(f"Function {func.__code__}") from ex
multimethod.DispatchError: Function <code object pandas_describe_1d at 0x7fa1d47960d0, file "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 19>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/jk/progs/profile-test/ydata_bugs.py", line 26, in <module>
compare_ydata(df2, df1)
File "/Users/jk/progs/profile-test/ydata_bugs.py", line 16, in compare_ydata
report1.compare(report2)
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/profile_report.py", line 543, in compare
return compare([self, other], config if config is not None else self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/compare_reports.py", line 339, in compare
labels, descriptions = _compare_profile_report_preprocess(reports, _config) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/compare_reports.py", line 146, in _compare_profile_report_preprocess
descriptions = [report.get_description() for report in reports]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/compare_reports.py", line 146, in <listcomp>
descriptions = [report.get_description() for report in reports]
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/profile_report.py", line 320, in get_description
return self.description_set
^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/profile_report.py", line 251, in description_set
self._description_set = describe_df(
^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/describe.py", line 72, in describe
series_description = get_series_descriptions(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/multimethod/__init__.py", line 317, in __call__
raise DispatchError(f"Function {func.__code__}") from ex
multimethod.DispatchError: Function <code object pandas_get_series_descriptions at 0x7fa1d47977d0, file "/Users/jk/progs/profile-test/venv/lib/python3.11/site-packages/ydata_profiling/model/pandas/summary_pandas.py", line 60>
Process finished with exit code 1
Expected Behaviour
Should show a comparison and not crash.
Data Description
See below.
Code that reproduces the bug
import sys
import pandas as pd
import ydata_profiling
from ydata_profiling import ProfileReport
print('Python', sys.version)
print('Pandas', pd.__version__)
print('ydata_profiling', ydata_profiling.__version__, '\n')
def compare_ydata(df1, df2):
report1 = ProfileReport(df1, title='df1', interactions=None, correlations=None)
report2 = ProfileReport(df2, title='df2', interactions=None, correlations=None)
report1.compare(report2)
df1 = pd.DataFrame({'a': ['2023-01-17 05:12:40', '2023-01-17 05:02:38']})
df2 = pd.DataFrame({'a': pd.to_datetime(['2023-06-08 07:02:00', '2023-01-05 08:07:00'])})
# Works
compare_ydata(df1, df2)
print('-----')
# Crashes
compare_ydata(df2, df1)
pandas-profiling version
v4.3.1
Dependencies
pandas==2.0.3
OS
MacOS 13.4
Checklist
- [X] There is not yet another bug report for this issue in the issue tracker
- [X] The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- [X] The issue has not been resolved by the entries listed under Common Issues.
Hi @jkleint,
Can you share a bit more context please? Based on our understanding, you want to compare 2 datasets with the same variable names, but with different data types identified. Is that correct?
@jkleint I was able to resolve this by updating Conda, updating the Pillow library, and then manually deleting the DroidSansMono.ttf file in this folder: C:\ProgramData\Anaconda3\Lib\site-packages\wordcloud\
Hi @jkleint,
Can you share a bit more context please? Based on our understanding, you want to compare 2 datasets with the same variable names, but with different data types identified. Is that correct?
Yes.
It's not that I want to, but dirty data happens, and a good tool should do something reasonable besides crash. The base level would be reporting that they have different types and not trying to compare. Ideal would be trying to coerce compatible types and do the comparison (or report when that's not possible).
Thanks!
@jkleint I've asked to understand better the use case that you trying to run / solve.
If this is a feature request, then it will be considered as such, if this is something that I can help you with with a workaround, more than happy to provide you one.
Regarding your request, yes dirty data might be expected indeed and we had that into consideration. Nevertheless, there is a very valid reason why this is not supported yet, metrics of 2 different data types are not comparable and for that reason the comparison in the end would not make sense.
In order to overcome the error you can always define the schema of the data prior running the report. This allows you to avoid the errors that are prompted.
I'm building a generic "compare datasets" tool for many data scientists to use. I want to say "point this tool at your data to see what's different." I want it to be super simple, one line of code, no knowledge required to use. I do not have control over their data; I don't even see it. Often it's very dirty; sometimes that means columns with the same names and different types. If the tool crashes in this case, it's not helpful, and people aren't going to use it.
I know you build your software to high standards, and you'd agree that ideally your software would not outright crash in any case, but handle errors gracefully. I've shared what looks to me like a bug; I say this because report1.compare(report2)
works, and it seems there is some basic logic to handle differing types; but report2.compare(report1)
crashes. It seems like the order of comparison shouldn't matter, and at the very least shouldn't crash in one case and work in the other. I feel the graceful fix is to recognize when columns have incomparable types, note that, and proceed with the comparison. Then just say in the report that the types were X and Y and so couldn't be compared. I hope you agree that's reasonable. Thanks!
I second this. Recently upgraded to ydata-profiling from pandas-profiling and I can't seem to run reports on old data because of this.
It would be nice if it failed gracefully since I think many people use this library in the early stages of data analysis when you have very little idea what your data looks like. Defining a schema isn't a trivial task (in my case I can have hundreds of columns).
At the minimum it'd be nice if it threw a more helpful error.