datacontract-cli
datacontract-cli copied to clipboard
Custom Quality Rules Not Included in Great Expectations Export
When using the datacontract export datacontract.yaml --format great-expectations command, the generated Great Expectations schema only includes column order and type checks. However, it does not incorporate the custom quality rules defined under the quality section for individual fields.
For example, given the following data contract fields section:
fields: id: description: Unique identifier for each alert. type: string required: true primaryKey: true quality: - type: custom engine: great-expectations description: "Uniqueness Check on id" implementation: expectation_type: expect_column_values_to_be_unique meta: notes: "Ensures that each id is unique." - type: custom engine: great-expectations description: "Type Check for id" implementation: expectation_type: expect_column_values_to_be_of_type kwargs: column: id type_: string meta: notes: "Checks that id is of type string." type: description: The type of alert that has fired. type: string required: true enum: ["A", "B", "C", "D", "E"] quality: - type: custom engine: great-expectations description: "Accepted Values for type" implementation: expectation_type: expect_column_values_to_be_in_set kwargs: value_set: ["A", "B", "C", "D", "E"] meta: notes: "Ensures that type values are one of the accepted values."
The current output only includes type checks and ordered column validation:
{ "expectations": [ { "expectation_type": "expect_table_columns_to_match_ordered_list", "kwargs": { "column_list": ["id", "type"] }, "meta": {} }, { "expectation_type": "expect_column_values_to_be_of_type", "kwargs": { "column": "id", "type_": "string" }, "meta": {} }, { "expectation_type": "expect_column_values_to_be_of_type", "kwargs": { "column": "type", "type_": "string" }, "meta": {} } ], "meta": {} }
Expected Output: The generated Great Expectations schema should also capture the custom quality rules, such as:
**expect_column_values_to_be_unique for the id column
expect_column_values_to_be_in_set for the type column**
Proposed Fix: Ensure that all quality rules under each field are included in the exported Great Expectations schema, in addition to the default schema and type checks.
@SimonAuger could you take a look at this issue?
Could you please let me know its current status and when we might expect a resolution?
We are using the datacontract-cli export command with GE to perform checks, and—as outlined in the issue—some of these checks are currently being missed. Any update you can provide would be greatly appreciated.
Hello @ReshuVishwakarma, thanks for opennings this issue. You are absolutely right. Currently, custom quality checks for GE can only be provided in JSON format and are merged with the auto-generated checks based on the model definition. Which include validations for type, uniqueness, and length.
For your use case, here’s an example of how the data contract could be structured:
dataContractSpecification: 1.1.0
id: my-data-contract-id
info:
title: Orders Unit Test
version: 1.0.0
owner: checkout
description: The orders data contract
contact:
email: [email protected]
url: https://wiki.example.com/teams/checkout
models:
orders:
description: test
fields:
id:
description: Unique identifier for each alert.
type: string
required: true
primaryKey: true
type:
description: The type of alert that has fired.
type: string
required: true
enum: [ "A", "B", "C", "D", "E" ]
quality:
type: great-expectations
specification:
orders: |-
[
{
"expectation_type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "type",
"value_set": ["A", "B", "C", "D", "E"]
},
"meta": {
"notes": "Ensures that type values are one of the accepted values."
}
}
]
I will try to find a moment next week to create a Pull Request to support GE quality checks yaml definition directly in fields
It would look to something like this. Would that be ok for you @jochenchrist and @ReshuVishwakarma ?
dataContractSpecification: 1.1.0
id: my-data-contract-id
info:
title: Orders Unit Test
version: 1.0.0
owner: checkout
description: The orders data contract
contact:
email: [email protected]
url: https://wiki.example.com/teams/checkout
models:
orders:
description: test
fields:
id:
description: Unique identifier for each alert.
type: string
required: true
primaryKey: true
unique: true
type:
description: The type of alert that has fired.
type: string
required: true
enum: [ "A", "B", "C", "D", "E" ]
quality:
- type: custom
engine: great-expectations
description: "Accepted Values for type"
implementation:
expectation_type: expect_column_values_to_be_in_set
kwargs:
value_set: [ "A", "B", "C", "D", "E" ]
meta:
notes: "Ensures that type values are one of the accepted values."
It would look to something like this. Would that be ok for you @jochenchrist and @ReshuVishwakarma ?
dataContractSpecification: 1.1.0 id: my-data-contract-id info: title: Orders Unit Test version: 1.0.0 owner: checkout description: The orders data contract contact: email: [email protected] url: https://wiki.example.com/teams/checkout models: orders: description: test fields: id: description: Unique identifier for each alert. type: string required: true primaryKey: true unique: true type: description: The type of alert that has fired. type: string required: true enum: [ "A", "B", "C", "D", "E" ] quality: - type: custom engine: great-expectations description: "Accepted Values for type" implementation: expectation_type: expect_column_values_to_be_in_set kwargs: value_set: [ "A", "B", "C", "D", "E" ] meta: notes: "Ensures that type values are one of the accepted values."
Hi @sauger92, thank you for sharing the single‑level pattern — it looks clear and aligns with our needs. However, as I’ve observed, in the current export the auto-generated type checks are only applied at the parent level and do not generate checks for nested fields. Could you please confirm that the same validation logic will be extended to nested structure like arrays (e.g. we should also generate the validation check for the nested priorityOrder field inside each rulesTriggered item.)? For example:
rulesTriggered:
type: array
required: true
items:
type: object
fields:
id:
type: string
required: true
name:
type: string
required: true
priorityOrder:
type: integer
required: true
quality:
- type: custom
engine: great-expectations
description: "Range Check on priorityOrder"
implementation:
expectation_type: expect_column_values_to_be_between
kwargs:
min_value: 0
max_value: 30
Looping Mark for the visibility of this issue. @pixie79
This seems like a duplication as you already have the enum with the list of values, it would make more sense to automatically add this rule to the export if there is an enum field than declare it as a separate manual rule. We are not trying to validate that someone can write the correct enum list after all as in reality all they will do is copy the line and change the name to read value_set. Personally I don’t thing this should be a custom rule but that it should be auto generated. Is there a good reason not too?
This seems like a duplication as you already have the enum with the list of values, it would make more sense to automatically add this rule to the export if there is an enum field than declare it as a separate manual rule. We are not trying to validate that someone can write the correct enum list after all as in reality all they will do is copy the line and change the name to read value_set. Personally I don’t thing this should be a custom rule but that it should be auto generated. Is there a good reason not too?
Hi @pixie79 ,
Thank you for the detailed observation. I understand the concern regarding the duplication when we already have an enum defined with its list of values. Your point about avoiding the manual declaration of a separate custom rule for enum is well taken. It does make sense to have the rule be auto-generated based on the enum field.
For instance:
dataContractSpecification: 1.1.0
id: my-data-contract-id
info:
title: Orders Unit Test
version: 1.0.0
owner: checkout
description: The orders data contract
contact:
email: [email protected]
url: https://wiki.example.com/teams/checkout
models:
orders:
description: test
fields:
id:
description: Unique identifier for each alert.
type: string
required: true
primaryKey: true
unique: true
type:
description: The type of alert that has fired.
type: string
required: true
enum: '#/definitions/type_enum'
definitions:
type:
title: type_enum
type: string
required: false
enum:
- "A"
- "B"
- "C"
- "D"
config:
avroType: enum
In this setup, the enum field for type should automatically trigger the generation of the appropriate value_set validation rules during export. This approach not only simplifies the configuration but also minimizes the risk of inconsistencies that might arise from manually duplicating data. However, we can consider custom rules to cases that involve range-related checks, minimum value checks, etc. as mentioned in the previous comment for PriorityOrder column.
@sauger92 are you good with the updated details or need further clarifications?
I just want to point out that the following
type:
description: The type of alert that has fired.
type: string
required: true
enum: '#/definitions/type_enum'
will not work with the Data Contract CLI, but
type:
description: The type of alert that has fired.
type: string
required: true
$ref: '#/definitions/type_enum'
will.
Regardless, I second both, the support for quality checks of gx at the field level, and the automatic creation of the quality check based on the enum field. @sauger92 thanks for your help already! :-)
Hi, I have created the Pull Request. Sorry for the delay I had a lot of works last few weeks. Do not hesitate to add feedbacks https://github.com/datacontract/datacontract-cli/pull/738
#738 was merged