ydata-profiling
ydata-profiling copied to clipboard
KeyError 'tinyint' during profiling on Apache Spark DataFrame
Current Behaviour
I encountered an error while attempting to run profiling on an Apache Spark DataFrame. The Spark DataFrame contains data retrieved from parquet files. The specific error message I received is as follows:
Traceback (most recent call last):
File "/tmp/profile.py", line 41, in <module>
profile.to_html()
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
return self.html
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 275, in html
self._html = self._render_html()
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
report = self.report
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 269, in report
self._report = get_report_structure(self.config, self.description_set)
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 256, in description_set
self._sample,
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/describe.py", line 73, in describe
config, df, summarizer, typeset, pbar
File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 93, in spark_get_series_descriptions
executor.imap_unordered(multiprocess_1d, args)
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 748, in next
raise value
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 88, in multiprocess_1d
return column, describe_1d(config, df.select(column), summarizer, typeset)
File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 62, in spark_describe_1d
}[dtype]
KeyError: 'tinyint'
I believe the issue can be resolved by including data types such as "tinyint" and "smallint" in summary_spark.py. Do you think it a right solution? If yes, I could try submitting a PR.
https://github.com/ydataai/ydata-profiling/blob/cfb020d9ad0ce7ef3be53962763b7a57b88732f9/src/ydata_profiling/model/spark/summary_spark.py#L52-L62
Expected Behaviour
Profiling runs
Data Description
Private dataset
Code that reproduces the bug
from ydata_profiling import ProfileReport
df = ...
profile = ProfileReport(
df,
title=’Title',
infer_dtypes=False,
interactions=None,
missing_diagrams=None,
correlations={'auto': {'calculate': False},
'pearson': {'calculate': True},
'spearman': {'calculate': True}},
)
pandas-profiling version
v4.3.1
Dependencies
...
OS
Spark cluster
Checklist
- [X] There is not yet another bug report for this issue in the issue tracker
- [X] The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- [X] The issue has not been resolved by the entries listed under Common Issues.
@fabclmnt I have faced this bug while working on timestamp_ntz data types. If @talgatomarov is uninterested, I can attempt to resolve it for various data types.
As I see no workaround has been mentioned, this is something that worked for me. For pyspark print the schema of the spark table, change columns that have 'short' dtype to 'int'. If you are converting pyspark dataframe to pandas print dtypes change smallint or tinyint to int.