ydata-profiling icon indicating copy to clipboard operation
ydata-profiling copied to clipboard

Common Values incorrectly reporting (missing)

Open danhosanee opened this issue 2 years ago • 2 comments

Current Behaviour

OS:Mac Python:3.11 Interface: Jupyter Lab pip: 22.3.1

dataset

DEPARTURE_DELAY ARRIVAL_DELAY DISTANCE SCHEDULED_DEPARTURE
-11.0 -22.0 1448 0.08333333333333333
-8.0 -9.0 2330 0.16666666666666666
-2.0 5.0 2296 0.3333333333333333
-5.0 -9.0 2342 0.3333333333333333
-1.0 -21.0 1448 0.4166666666666667

It appears that when a column value_counts exceeds 200 within the common values section:

  • when aggregations of a value exceed 200 the remaining is categorised as (missing)

It overall contradicts Missing and Missing(%) main statistics for a variable

image

Expected Behaviour

The (Missing) section within Common values should be removed or the difference to "other values"

Data Description

https://github.com/plotly/datasets/blob/master/2015_flights.parquet

Code that reproduces the bug

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from ydata_profiling import ProfileReport
import json

spark = SparkSession.builder.appName("ydata").getOrCreate()

spark_df = spark.read.parquet("ydata-test/2015_flights.parquet")

n_notnull = spark_df.filter(F.col("SCHEDULED_DEPARTURE").isNotNull()).count() 

profile = ProfileReport(spark_df, minimal=True)

value_counts_values = sum(json.loads(profile.to_json())["variables"]["SCHEDULED_DEPARTURE"]["value_counts_without_nan"].values())

missing_common_values = 1650418 # as per html report

assert missing_common_values ==  (n_notnull - value_counts_values)

pandas-profiling version

4.5.1

Dependencies

ydata-profiling==4.5.1
pyspark==3.4.1
pandas==2.0.3
numpy==1.23.5

OS

macos

Checklist

  • [X] There is not yet another bug report for this issue in the issue tracker
  • [X] The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • [X] The issue has not been resolved by the entries listed under Common Issues.

danhosanee avatar Aug 24 '23 12:08 danhosanee

@fabclmnt @aquemy - This issue can be resolved if the limit(200) is removed from describe_counts_spark.py. I know there are comments regarding performance, but I think use-ability overrides performance concerns at this point as the limit is causing incorrect outputs as shown above. Happy to create a PR with unitests if allowed?

In terms of improving this are we open to making a spark verison for freq_table ?

danhosanee avatar Jan 28 '24 05:01 danhosanee