SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Add validate method to SingleTableMetadata

Open amontanez24 opened this issue 2 years ago • 0 comments

Problem Description

As a user, it would be useful to be able to validate whether my metadata is formatted correctly

Expected behavior

  • Add validate method
  • Validation consists of validating three separate parts of the metadata. The full details are in the Additional context section.
    1. Validating the columns
    2. Validating the keys
    3. Validating the constraints
  • If the metadata is not valid: Raises an InvalidMetadataError with a description of all the errors found.
>>> metadata.validate()
InvalidMetadataError: The metadata is not valid

Error: Invalid values ("pii") for datetime column "start_date".
Error: Invalid regex format string "[A-{6}" for text column "user_id"
Error: Unknown key value 'uuid'. Keys should be columns that exist in the table.
Error: A Unique constraint is being applied to column "user_id". This column is already a key for that table.
Error: Invalid increment value (0.5) in a FixedIncrements constraint. Increments must be positive integers.

Additional context

  • Column validation: Each sdtype has different validation rules. They are listed below

    • numerical
      • Required attributes are: representation
      • Throw an error if any attributes besides representation are present Error: Invalid values ("pii") for numerical column "age".
    • datetime
      • Required attributes are: datetime_format
      • "datetime_format" must be a valid, parsable format string Error: Invalid datetime format string "%O" for datetime column "start_date"
      • There should no other attributes present Error: Invalid values ("pii") for datetime column "start_date".
    • categorical
      • Required attributes are either: order or order_by
      • "order" and "order_by" cannot both be present. You can only have 0 or 1 of these attributes Error: Categorical column "education" has both an "order" and "order_by" attribute. Only 1 is allowed.
      • If present, "order_by" must be either set to "numerical_value" or "alphabetical" Error: Unknown ordering method "testing" provided for categorical column "education". Ordering method must be "numerical_value" or "alphabetical"
      • If present, "order" must be a list with 1 or more elements Error: Invalid order value provided for categorical column "education". The "order" must be a list with 1 or more elements.
      • No other attributes can be present Error: Invalid values ("pii") for categorical column "education".
    • boolean
      • No required attributes
      • Throw an error if any attributes are present Error: Invalid value ("pii") for boolean column "is_subscribed".
    • text
      • Required attributes are: regex_format
      • "regex_format" is present but the string isn't a valid regex string that can be parsed Error: Invalid regex format string "[A-{6}" for text column "user_id"
      • Throw an error if any other attributes are present Error: Invalid values ("pii") for text column "user_id".
    • Real World (Semantic) Types (ie. phone_number)
      • No required parameters
      • pii is an optional parameter
      • If "pii" exists but it is not True or False, throw an error Error: Invalid pii value provided for phone_number column "user_cell". The "pii" value must be set to True or False.
      • Throw an error if any other attributes are present Error: Invalid values ("datetime_format") for phone_number column "user_cell".
      • raise a warning if the sdtype isn't fully supported. Warning: sdtype 'location' is not fully supported. The SDV will model this as a categorical variable. Warning: sdtype 'location' is not fully supported. The SDV will anonymize this column using random characters.
  • Key validation

    • "primary_key" must be a string or list of strings
    • "sequence_key" must a string or list of strings
    • "alternate_keys" must be a list of strings or a nested list of strings
    • "sequence_index" must be a string
    • The strings must correspond to the column names as specified in the other part of the Metadata Error: Unknown key value 'uuid'. Keys should be columns that exist in the table.
    • "sequence_index" cannot be the same as "sequence_key" Error: sequence_index and sequence_key have the same value ('patient_id'). These columns must be different.
  • Constraint validation

    • Use each constraints _validate_inputs method and surface those errors #878
    • Check for the following errors still
    • Unique
      • Each of the column names in "column_names" must be a column that is present in the "columns" specification Error: A Unique constraint is being applied to invalid column names ("age", "weight"). The columns must exist in the table.
      • "column_names" must include at least 1 column that is NOT a primary key or alternate key. Primary keys and alternate keys will already be guaranteed to be unique, so there's no need to add it in as a constraint. Error: A Unique constraint is being applied to column "age". This column is already a key for that table.
    • FixedCombinations
      • Each of the column names in "colum_names" must be a column that is present in the "columns" specification Error: A FixedCombinations constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
    • Inequality
      • The string in "high_column_name" and "low_column_name" must be a column that is present in the "columns" specification Error: An Inequality constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
      • Both high and low columns must be either type "numerical" or type "datetime" Error: An Inequality constraint is being applied to mismatched sdtypes ("C", "D"). Both columns must be either numerical or datetime.
    • ScalarInequality
      • "column_name" must refer to a column in the table Error: A ScalarInequality constraint is being applied to invalid column names ("C"). The columns must exist in the table.
      • The "value" must make sense based on the column type
        • If the column is "numerical", then "value" must be an int or float
        • If the column is "datetime", then "value" must be a datetime string of the right format
        • No other types are compatible Error: A ScalarInequality constraint is being applied to mismatched sdtypes. Numerical columns must be compared to integer or float values. Datetimes column must be compared to datetime strings.
    • Range
      • The strings in each of the column names must be a column that is present in the "columns" specification Error: A Range constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
      • All columns must be either type "numerical" or type "datetime" Error: A Range constraint is being applied to mismatched sdtypes ("C", "D", "E"). All columns must be either numerical or datetime.
    • ScalarRange
      • "column_name" must refer to a column in the table Error: A ScalarRange constraint is being applied to invalid column names ("C"). The columns must exist in the table.
      • The high and low values must make sense based on the column type
        • If the column is "numerical", then the values must be floats/ints
        • If the column is "datetime", then the values must be a datetime string of the right format
        • No other types are compatible Error: A ScalarRange constraint is being applied to mismatched sdtypes. Numerical columns must be compared to integer or float values. Datetimes column must be compared to datetime strings.
    • Positive
      • "column_name" must refer to a column in the table Error: A Positive constraint is being applied to invalid column names ("C"). The columns must exist in the table.
      • Column name must type "numerical" Error: A Positive constraint is being applied to an invalid column ("C"). This constraint is only defined for numerical columns.
    • Negative
      • "column_name" must refer to a column in the table Error: A Negative constraint is being applied to invalid column names ("C"). The columns must exist in the table.
      • Column name must type "numerical" Error: A Negative constraint is being applied to an invalid column ("C"). This constraint is only defined for numerical columns.
    • FixedIncrements
      • Column name should refer to a column defined in the metadata Error: A FixedIncrements constraint is being applied to invalid column names ("C"). The columns must exist in the table.
    • OneHotEncoding
      • Column names must be valid columns (present in the "columns" part of the metadata) Error: A OneHotEncoding constraint is being applied to invalid column names ("C", "D", "E"). The columns must exist in the table.
    • CustomConstraint
      • Column names must be valid columns (present in the "columns part of the metadata) Error: A <module>.<name> constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
    • Misc
      • If the constraint isn't found, throw an error: Error: Invalid constraints ('Other').

amontanez24 avatar Jul 08 '22 00:07 amontanez24