missing_percent unexpected output after filtering all rows

Open migueldoblado opened this issue 3 months ago • 1 comments

Hi,

I’ve encountered an issue with the missing_percent check when a filter excludes all rows from the dataset. In this scenario, the check unexpectedly fails.

Here’s a minimal reproducible example:

from pyspark.sql import SparkSession
from soda.scan import Scan

spark = SparkSession.builder.appName("SodaScanTest").getOrCreate()

data = [
    (1, "Alice", 29),
    (2, "Bob", 25),
    (3, "Charlie", None),
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.createOrReplaceTempView("people")

scan = Scan()
scan.set_scan_definition_name("soda_scan_test")
scan.set_data_source_name("spark_df")
scan.add_spark_session(spark)
scan.set_verbose(True)

scan.add_sodacl_yaml_str("""
checks for people:
  - missing_percent(age):
      fail: when < 100
      filter: name = 'Diana'
""")

scan.execute()

if scan.has_check_fails():
    print(scan.get_logs_text())
    print("Scan failed!")
else:
    print("Scan succeeded!")

spark.stop()

Observed output:

INFO | 1/1 check FAILED:
INFO | people in spark_df
INFO | missing_percent(age) fail when < 100 [FAILED]
INFO | check_value: 0.0
INFO | row_count: 0
INFO | missing_count: 0
Scan failed!

Expected behavior:
If the filter excludes all rows (i.e., row_count: 0), I would expect the check to pass, since there are no non-missing values for the filtered rows. Failing the check in this case seems unintuitive.

Is this the intended behavior? If not, could the check be adjusted to pass when all rows are filtered out?

Thank you!

Sep 10 '25 07:09 migueldoblado

CLOUD-9199

Sep 10 '25 07:09 tools-soda