SDV
SDV copied to clipboard
Add validate method to SingleTableMetadata
Problem Description
As a user, it would be useful to be able to validate whether my metadata is formatted correctly
Expected behavior
- Add
validate
method - Validation consists of validating three separate parts of the metadata. The full details are in the
Additional context
section.- Validating the columns
- Validating the keys
- Validating the constraints
- If the metadata is not valid: Raises an
InvalidMetadataError
with a description of all the errors found.
>>> metadata.validate()
InvalidMetadataError: The metadata is not valid
Error: Invalid values ("pii") for datetime column "start_date".
Error: Invalid regex format string "[A-{6}" for text column "user_id"
Error: Unknown key value 'uuid'. Keys should be columns that exist in the table.
Error: A Unique constraint is being applied to column "user_id". This column is already a key for that table.
Error: Invalid increment value (0.5) in a FixedIncrements constraint. Increments must be positive integers.
Additional context
-
Column validation: Each
sdtype
has different validation rules. They are listed below-
numerical
- Required attributes are:
representation
- Throw an error if any attributes besides
representation
are presentError: Invalid values ("pii") for numerical column "age".
- Required attributes are:
-
datetime
- Required attributes are:
datetime_format
- "datetime_format" must be a valid, parsable format string
Error: Invalid datetime format string "%O" for datetime column "start_date"
- There should no other attributes present
Error: Invalid values ("pii") for datetime column "start_date".
- Required attributes are:
-
categorical
- Required attributes are either:
order
ororder_by
- "order" and "order_by" cannot both be present. You can only have 0 or 1 of these attributes
Error: Categorical column "education" has both an "order" and "order_by" attribute. Only 1 is allowed.
- If present, "order_by" must be either set to "numerical_value" or "alphabetical"
Error: Unknown ordering method "testing" provided for categorical column "education". Ordering method must be "numerical_value" or "alphabetical"
- If present, "order" must be a list with 1 or more elements
Error: Invalid order value provided for categorical column "education". The "order" must be a list with 1 or more elements.
- No other attributes can be present
Error: Invalid values ("pii") for categorical column "education".
- Required attributes are either:
-
boolean
- No required attributes
- Throw an error if any attributes are present
Error: Invalid value ("pii") for boolean column "is_subscribed".
-
text
- Required attributes are:
regex_format
- "regex_format" is present but the string isn't a valid regex string that can be parsed
Error: Invalid regex format string "[A-{6}" for text column "user_id"
- Throw an error if any other attributes are present
Error: Invalid values ("pii") for text column "user_id".
- Required attributes are:
-
Real World (Semantic) Types
(ie.phone_number
)- No required parameters
-
pii
is an optional parameter - If "pii" exists but it is not True or False, throw an error
Error: Invalid pii value provided for phone_number column "user_cell". The "pii" value must be set to True or False.
- Throw an error if any other attributes are present
Error: Invalid values ("datetime_format") for phone_number column "user_cell".
- raise a warning if the
sdtype
isn't fully supported.Warning: sdtype 'location' is not fully supported. The SDV will model this as a categorical variable.
Warning: sdtype 'location' is not fully supported. The SDV will anonymize this column using random characters.
-
-
Key validation
- "primary_key" must be a string or list of strings
- "sequence_key" must a string or list of strings
- "alternate_keys" must be a list of strings or a nested list of strings
- "sequence_index" must be a string
- The strings must correspond to the column names as specified in the other part of the Metadata
Error: Unknown key value 'uuid'. Keys should be columns that exist in the table.
- "sequence_index" cannot be the same as "sequence_key"
Error: sequence_index and sequence_key have the same value ('patient_id'). These columns must be different.
-
Constraint validation
- Use each constraints
_validate_inputs
method and surface those errors #878 - Check for the following errors still
-
Unique
- Each of the column names in "column_names" must be a column that is present in the "columns" specification
Error: A Unique constraint is being applied to invalid column names ("age", "weight"). The columns must exist in the table.
- "column_names" must include at least 1 column that is NOT a primary key or alternate key. Primary keys and alternate keys will already be guaranteed to be unique, so there's no need to add it in as a constraint.
Error: A Unique constraint is being applied to column "age". This column is already a key for that table.
- Each of the column names in "column_names" must be a column that is present in the "columns" specification
-
FixedCombinations
- Each of the column names in "colum_names" must be a column that is present in the "columns" specification
Error: A FixedCombinations constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
- Each of the column names in "colum_names" must be a column that is present in the "columns" specification
-
Inequality
- The string in "high_column_name" and "low_column_name" must be a column that is present in the "columns" specification
Error: An Inequality constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
- Both high and low columns must be either type "numerical" or type "datetime"
Error: An Inequality constraint is being applied to mismatched sdtypes ("C", "D"). Both columns must be either numerical or datetime.
- The string in "high_column_name" and "low_column_name" must be a column that is present in the "columns" specification
-
ScalarInequality
- "column_name" must refer to a column in the table
Error: A ScalarInequality constraint is being applied to invalid column names ("C"). The columns must exist in the table.
- The "value" must make sense based on the column type
- If the column is "numerical", then "value" must be an int or float
- If the column is "datetime", then "value" must be a datetime string of the right format
- No other types are compatible
Error: A ScalarInequality constraint is being applied to mismatched sdtypes. Numerical columns must be compared to integer or float values. Datetimes column must be compared to datetime strings.
- "column_name" must refer to a column in the table
-
Range
- The strings in each of the column names must be a column that is present in the "columns" specification
Error: A Range constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
- All columns must be either type "numerical" or type "datetime"
Error: A Range constraint is being applied to mismatched sdtypes ("C", "D", "E"). All columns must be either numerical or datetime.
- The strings in each of the column names must be a column that is present in the "columns" specification
-
ScalarRange
- "column_name" must refer to a column in the table
Error: A ScalarRange constraint is being applied to invalid column names ("C"). The columns must exist in the table.
- The high and low values must make sense based on the column type
- If the column is "numerical", then the values must be floats/ints
- If the column is "datetime", then the values must be a datetime string of the right format
- No other types are compatible
Error: A ScalarRange constraint is being applied to mismatched sdtypes. Numerical columns must be compared to integer or float values. Datetimes column must be compared to datetime strings.
- "column_name" must refer to a column in the table
-
Positive
- "column_name" must refer to a column in the table
Error: A Positive constraint is being applied to invalid column names ("C"). The columns must exist in the table.
- Column name must type "numerical"
Error: A Positive constraint is being applied to an invalid column ("C"). This constraint is only defined for numerical columns.
- "column_name" must refer to a column in the table
-
Negative
- "column_name" must refer to a column in the table
Error: A Negative constraint is being applied to invalid column names ("C"). The columns must exist in the table.
- Column name must type "numerical"
Error: A Negative constraint is being applied to an invalid column ("C"). This constraint is only defined for numerical columns.
- "column_name" must refer to a column in the table
-
FixedIncrements
- Column name should refer to a column defined in the metadata
Error: A FixedIncrements constraint is being applied to invalid column names ("C"). The columns must exist in the table.
- Column name should refer to a column defined in the metadata
- OneHotEncoding
- Column names must be valid columns (present in the "columns" part of the metadata)
Error: A OneHotEncoding constraint is being applied to invalid column names ("C", "D", "E"). The columns must exist in the table.
- Column names must be valid columns (present in the "columns" part of the metadata)
-
CustomConstraint
- Column names must be valid columns (present in the "columns part of the metadata)
Error: A <module>.<name> constraint is being applied to invalid column names ("C", "D"). The columns must exist in the table.
- Column names must be valid columns (present in the "columns part of the metadata)
- Misc
- If the constraint isn't found, throw an error:
Error: Invalid constraints ('Other').
- If the constraint isn't found, throw an error:
- Use each constraints