ydata-profiling icon indicating copy to clipboard operation
ydata-profiling copied to clipboard

KeyError 'tinyint' during profiling on Apache Spark DataFrame

Open talgatomarov opened this issue 1 year ago • 2 comments

Current Behaviour

I encountered an error while attempting to run profiling on an Apache Spark DataFrame. The Spark DataFrame contains data retrieved from parquet files. The specific error message I received is as follows:

Traceback (most recent call last):
  File "/tmp/profile.py", line 41, in <module>
    profile.to_html()
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
    return self.html
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 275, in html
    self._html = self._render_html()
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
    report = self.report
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 269, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 256, in description_set
    self._sample,
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/describe.py", line 73, in describe
    config, df, summarizer, typeset, pbar
  File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 93, in spark_get_series_descriptions
    executor.imap_unordered(multiprocess_1d, args)
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 88, in multiprocess_1d
    return column, describe_1d(config, df.select(column), summarizer, typeset)
  File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 62, in spark_describe_1d
    }[dtype]
KeyError: 'tinyint'

I believe the issue can be resolved by including data types such as "tinyint" and "smallint" in summary_spark.py. Do you think it a right solution? If yes, I could try submitting a PR.

https://github.com/ydataai/ydata-profiling/blob/cfb020d9ad0ce7ef3be53962763b7a57b88732f9/src/ydata_profiling/model/spark/summary_spark.py#L52-L62

Expected Behaviour

Profiling runs

Data Description

Private dataset

Code that reproduces the bug

from ydata_profiling import ProfileReport

df = ...

profile = ProfileReport(
    df,
    title=’Title',
    infer_dtypes=False,
    interactions=None,
    missing_diagrams=None,
    correlations={'auto': {'calculate': False},
                  'pearson': {'calculate': True},
                  'spearman': {'calculate': True}},
    )

pandas-profiling version

v4.3.1

Dependencies

...

OS

Spark cluster

Checklist

  • [X] There is not yet another bug report for this issue in the issue tracker
  • [X] The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • [X] The issue has not been resolved by the entries listed under Common Issues.

talgatomarov avatar Jun 30 '23 21:06 talgatomarov

@fabclmnt I have faced this bug while working on timestamp_ntz data types. If @talgatomarov is uninterested, I can attempt to resolve it for various data types.

oguzhangur96 avatar Nov 13 '23 12:11 oguzhangur96

As I see no workaround has been mentioned, this is something that worked for me. For pyspark print the schema of the spark table, change columns that have 'short' dtype to 'int'. If you are converting pyspark dataframe to pandas print dtypes change smallint or tinyint to int.

hb0313 avatar Feb 13 '24 16:02 hb0313