Common Values incorrectly reporting (missing)

Open danhosanee opened this issue 2 years ago • 2 comments

Current Behaviour

OS:Mac Python:3.11 Interface: Jupyter Lab pip: 22.3.1

DEPARTURE_DELAY	ARRIVAL_DELAY	DISTANCE	SCHEDULED_DEPARTURE
-11.0	-22.0	1448	0.08333333333333333
-8.0	-9.0	2330	0.16666666666666666
-2.0	5.0	2296	0.3333333333333333
-5.0	-9.0	2342	0.3333333333333333
-1.0	-21.0	1448	0.4166666666666667

It appears that when a column value_counts exceeds 200 within the common values section:

when aggregations of a value exceed 200 the remaining is categorised as (missing)

It overall contradicts Missing and Missing(%) main statistics for a variable

Expected Behaviour

The (Missing) section within Common values should be removed or the difference to "other values"

Data Description

https://github.com/plotly/datasets/blob/master/2015_flights.parquet

Code that reproduces the bug

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from ydata_profiling import ProfileReport
import json

spark = SparkSession.builder.appName("ydata").getOrCreate()

spark_df = spark.read.parquet("ydata-test/2015_flights.parquet")

n_notnull = spark_df.filter(F.col("SCHEDULED_DEPARTURE").isNotNull()).count() 

profile = ProfileReport(spark_df, minimal=True)

value_counts_values = sum(json.loads(profile.to_json())["variables"]["SCHEDULED_DEPARTURE"]["value_counts_without_nan"].values())

missing_common_values = 1650418 # as per html report

assert missing_common_values ==  (n_notnull - value_counts_values)

pandas-profiling version

4.5.1

Dependencies

ydata-profiling==4.5.1
pyspark==3.4.1
pandas==2.0.3
numpy==1.23.5

OS

macos

Checklist

[X] There is not yet another bug report for this issue in the issue tracker
[X] The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
[X] The issue has not been resolved by the entries listed under Common Issues.

Aug 24 '23 12:08 danhosanee

@fabclmnt @aquemy - This issue can be resolved if the limit(200) is removed from describe_counts_spark.py. I know there are comments regarding performance, but I think use-ability overrides performance concerns at this point as the limit is causing incorrect outputs as shown above. Happy to create a PR with unitests if allowed?

In terms of improving this are we open to making a spark verison for freq_table ?

Jan 28 '24 05:01 danhosanee