soda-core icon indicating copy to clipboard operation
soda-core copied to clipboard

Bug in reading pandas df columns

Open aayush16 opened this issue 3 years ago • 3 comments

Soda checks for pandas df columns fail if the column names are not lowercase or have any special characters in them.

aayush16 avatar Jan 25 '23 21:01 aayush16

SODA-1392

jmarien avatar Jan 25 '23 21:01 jmarien

Thank you for creating the issue @aayush16. Can you please share a reproducible example?

We use datafusion for the SQL operations in pandas/dask. Based on this issue in datafusion, adding double quotes (") might solve the uppercase and special characters issue.

Let me know if adding quotes helps.

The following example having uppercase column name works with quotes ("EMAIL");

import pandas as pd
from soda.scan import Scan

# Create soda scan object
scan = Scan()
scan.set_scan_definition_name("test")
scan.set_data_source_name("dask")

# Create an artificial pandas dataframe
df_employee = pd.DataFrame({"EMAIL": ["[email protected]", "[email protected]", "[email protected]"]})

# Add pandas dataframe to scan and assign a dataset name to refer from checks yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee)

checks = """
checks for employee:
  - missing_count("EMAIL") = 0
 """ 

scan.add_sodacl_yaml_str(checks) scan.execute()

baturayo avatar Jan 29 '23 23:01 baturayo

I couldn't replicate the problem with the capital letters either.

But if the name of the column has a middle dash I receive the following error:

ERROR:soda.scan:[23:29:32] Query execution error in dask.employee.aggregation[0]: SchemaError(FieldNotFound { field: Column { relation: None, name: "employee.first-name" }, valid_fields: Some([Column { relation: Some("employee"), name: "first-name" }]) })
SELECT
  COUNT(CASE WHEN "first-name" IS NULL THEN 1 END)
FROM employee
ERROR:soda.scan:  | SchemaError(FieldNotFound { field: Column { relation: None, name: "employee.first-name" }, valid_fields: Some([Column { relation: Some("employee"), name: "first-name" }]) })
ERROR:soda.scan:[23:29:32] Metrics 'missing_count' were not computed for check 'missing_count("first-name") = 0'
INFO:soda.scan:[23:29:32] Scan summary:
INFO:soda.scan:[23:29:32] 1/1 check NOT EVALUATED:
INFO:soda.scan:[23:29:32]     employee in dask
INFO:soda.scan:[23:29:32]       missing_count("first-name") = 0 [NOT EVALUATED]
INFO:soda.scan:[23:29:32]         check_value: None
INFO:soda.scan:[23:29:32] 1 checks not evaluated.
INFO:soda.scan:[23:29:32] 2 errors.
INFO:soda.scan:[23:29:32] Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
INFO:soda.scan:ERRORS:
ERROR:soda.scan:[23:29:32] Query execution error in dask.employee.aggregation[0]: SchemaError(FieldNotFound { field: Column { relation: None, name: "employee.first-name" }, valid_fields: Some([Column { relation: Some("employee"), name: "first-name" }]) })
SELECT
  COUNT(CASE WHEN "first-name" IS NULL THEN 1 END)
FROM employee
ERROR:soda.scan:  | SchemaError(FieldNotFound { field: Column { relation: None, name: "employee.first-name" }, valid_fields: Some([Column { relation: Some("employee"), name: "first-name" }]) })
ERROR:soda.scan:[23:29:32] Metrics 'missing_count' were not computed for check 'missing_count("first-name") = 0'

The check I'm using:

import pandas as pd
from soda.scan import Scan

# Create soda scan object
scan = Scan()
scan.set_scan_definition_name("test")
scan.set_data_source_name("dask")

# Create an artificial pandas dataframe
df_employee = pd.DataFrame({
  "first-name":["people1", "people2", "people3"]
  })

# Add pandas dataframe to scan and assign a dataset name to refer from checks yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee)

checks = """
checks for employee:
  - missing_count("first-name") = 0
 """ 

scan.add_sodacl_yaml_str(checks)
scan.execute()

If I change first-name to first_name the check pass.

Details

OS: Window OS Version: Python 3.10.8 Related libraries:

  • dask==2023.1.1
  • dask-sql==2023.2.0
  • numpy==1.24.2
  • pandas==1.5.3
  • soda-core==3.0.22
  • soda-core-pandas-dask==3.0.22

geeklogbook avatar Feb 10 '23 02:02 geeklogbook