Common Values incorrectly reporting (missing)
Current Behaviour
OS:Mac Python:3.11 Interface: Jupyter Lab pip: 22.3.1
| DEPARTURE_DELAY | ARRIVAL_DELAY | DISTANCE | SCHEDULED_DEPARTURE |
|---|---|---|---|
| -11.0 | -22.0 | 1448 | 0.08333333333333333 |
| -8.0 | -9.0 | 2330 | 0.16666666666666666 |
| -2.0 | 5.0 | 2296 | 0.3333333333333333 |
| -5.0 | -9.0 | 2342 | 0.3333333333333333 |
| -1.0 | -21.0 | 1448 | 0.4166666666666667 |
It appears that when a column value_counts exceeds 200 within the common values section:
- when aggregations of a value exceed 200 the remaining is categorised as
(missing)
It overall contradicts Missing and Missing(%) main statistics for a variable
Expected Behaviour
The (Missing) section within Common values should be removed or the difference to "other values"
Data Description
https://github.com/plotly/datasets/blob/master/2015_flights.parquet
Code that reproduces the bug
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from ydata_profiling import ProfileReport
import json
spark = SparkSession.builder.appName("ydata").getOrCreate()
spark_df = spark.read.parquet("ydata-test/2015_flights.parquet")
n_notnull = spark_df.filter(F.col("SCHEDULED_DEPARTURE").isNotNull()).count()
profile = ProfileReport(spark_df, minimal=True)
value_counts_values = sum(json.loads(profile.to_json())["variables"]["SCHEDULED_DEPARTURE"]["value_counts_without_nan"].values())
missing_common_values = 1650418 # as per html report
assert missing_common_values == (n_notnull - value_counts_values)
pandas-profiling version
4.5.1
Dependencies
ydata-profiling==4.5.1
pyspark==3.4.1
pandas==2.0.3
numpy==1.23.5
OS
macos
Checklist
- [X] There is not yet another bug report for this issue in the issue tracker
- [X] The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- [X] The issue has not been resolved by the entries listed under Common Issues.
@fabclmnt @aquemy - This issue can be resolved if the limit(200) is removed from describe_counts_spark.py. I know there are comments regarding performance, but I think use-ability overrides performance concerns at this point as the limit is causing incorrect outputs as shown above. Happy to create a PR with unitests if allowed?
In terms of improving this are we open to making a spark verison for freq_table ?