soda-core
soda-core copied to clipboard
missing_percent unexpected output after filtering all rows
Hi,
I’ve encountered an issue with the missing_percent check when a filter excludes all rows from the dataset. In this scenario, the check unexpectedly fails.
Here’s a minimal reproducible example:
from pyspark.sql import SparkSession
from soda.scan import Scan
spark = SparkSession.builder.appName("SodaScanTest").getOrCreate()
data = [
(1, "Alice", 29),
(2, "Bob", 25),
(3, "Charlie", None),
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.createOrReplaceTempView("people")
scan = Scan()
scan.set_scan_definition_name("soda_scan_test")
scan.set_data_source_name("spark_df")
scan.add_spark_session(spark)
scan.set_verbose(True)
scan.add_sodacl_yaml_str("""
checks for people:
- missing_percent(age):
fail: when < 100
filter: name = 'Diana'
""")
scan.execute()
if scan.has_check_fails():
print(scan.get_logs_text())
print("Scan failed!")
else:
print("Scan succeeded!")
spark.stop()
Observed output:
INFO | 1/1 check FAILED:
INFO | people in spark_df
INFO | missing_percent(age) fail when < 100 [FAILED]
INFO | check_value: 0.0
INFO | row_count: 0
INFO | missing_count: 0
Scan failed!
Expected behavior:
If the filter excludes all rows (i.e., row_count: 0), I would expect the check to pass, since there are no non-missing values for the filtered rows. Failing the check in this case seems unintuitive.
Is this the intended behavior? If not, could the check be adjusted to pass when all rows are filtered out?
Thank you!
CLOUD-9199