Bug in reading pandas df columns
Soda checks for pandas df columns fail if the column names are not lowercase or have any special characters in them.
SODA-1392
Thank you for creating the issue @aayush16. Can you please share a reproducible example?
We use datafusion for the SQL operations in pandas/dask. Based on this issue in datafusion, adding double quotes (") might solve the uppercase and special characters issue.
Let me know if adding quotes helps.
The following example having uppercase column name works with quotes ("EMAIL");
import pandas as pd
from soda.scan import Scan
# Create soda scan object
scan = Scan()
scan.set_scan_definition_name("test")
scan.set_data_source_name("dask")
# Create an artificial pandas dataframe
df_employee = pd.DataFrame({"EMAIL": ["[email protected]", "[email protected]", "[email protected]"]})
# Add pandas dataframe to scan and assign a dataset name to refer from checks yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee)
checks = """
checks for employee:
- missing_count("EMAIL") = 0
"""
scan.add_sodacl_yaml_str(checks) scan.execute()
I couldn't replicate the problem with the capital letters either.
But if the name of the column has a middle dash I receive the following error:
ERROR:soda.scan:[23:29:32] Query execution error in dask.employee.aggregation[0]: SchemaError(FieldNotFound { field: Column { relation: None, name: "employee.first-name" }, valid_fields: Some([Column { relation: Some("employee"), name: "first-name" }]) })
SELECT
COUNT(CASE WHEN "first-name" IS NULL THEN 1 END)
FROM employee
ERROR:soda.scan: | SchemaError(FieldNotFound { field: Column { relation: None, name: "employee.first-name" }, valid_fields: Some([Column { relation: Some("employee"), name: "first-name" }]) })
ERROR:soda.scan:[23:29:32] Metrics 'missing_count' were not computed for check 'missing_count("first-name") = 0'
INFO:soda.scan:[23:29:32] Scan summary:
INFO:soda.scan:[23:29:32] 1/1 check NOT EVALUATED:
INFO:soda.scan:[23:29:32] employee in dask
INFO:soda.scan:[23:29:32] missing_count("first-name") = 0 [NOT EVALUATED]
INFO:soda.scan:[23:29:32] check_value: None
INFO:soda.scan:[23:29:32] 1 checks not evaluated.
INFO:soda.scan:[23:29:32] 2 errors.
INFO:soda.scan:[23:29:32] Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
INFO:soda.scan:ERRORS:
ERROR:soda.scan:[23:29:32] Query execution error in dask.employee.aggregation[0]: SchemaError(FieldNotFound { field: Column { relation: None, name: "employee.first-name" }, valid_fields: Some([Column { relation: Some("employee"), name: "first-name" }]) })
SELECT
COUNT(CASE WHEN "first-name" IS NULL THEN 1 END)
FROM employee
ERROR:soda.scan: | SchemaError(FieldNotFound { field: Column { relation: None, name: "employee.first-name" }, valid_fields: Some([Column { relation: Some("employee"), name: "first-name" }]) })
ERROR:soda.scan:[23:29:32] Metrics 'missing_count' were not computed for check 'missing_count("first-name") = 0'
The check I'm using:
import pandas as pd
from soda.scan import Scan
# Create soda scan object
scan = Scan()
scan.set_scan_definition_name("test")
scan.set_data_source_name("dask")
# Create an artificial pandas dataframe
df_employee = pd.DataFrame({
"first-name":["people1", "people2", "people3"]
})
# Add pandas dataframe to scan and assign a dataset name to refer from checks yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee)
checks = """
checks for employee:
- missing_count("first-name") = 0
"""
scan.add_sodacl_yaml_str(checks)
scan.execute()
If I change first-name to first_name the check pass.
Details
OS: Window OS Version: Python 3.10.8 Related libraries:
- dask==2023.1.1
- dask-sql==2023.2.0
- numpy==1.24.2
- pandas==1.5.3
- soda-core==3.0.22
- soda-core-pandas-dask==3.0.22