datacontract-cli icon indicating copy to clipboard operation
datacontract-cli copied to clipboard

Custom Quality Rules Not Included in Great Expectations Export

Open ReshuVishwakarma opened this issue 8 months ago • 12 comments

When using the datacontract export datacontract.yaml --format great-expectations command, the generated Great Expectations schema only includes column order and type checks. However, it does not incorporate the custom quality rules defined under the quality section for individual fields.

For example, given the following data contract fields section:

fields: id: description: Unique identifier for each alert. type: string required: true primaryKey: true quality: - type: custom engine: great-expectations description: "Uniqueness Check on id" implementation: expectation_type: expect_column_values_to_be_unique meta: notes: "Ensures that each id is unique." - type: custom engine: great-expectations description: "Type Check for id" implementation: expectation_type: expect_column_values_to_be_of_type kwargs: column: id type_: string meta: notes: "Checks that id is of type string." type: description: The type of alert that has fired. type: string required: true enum: ["A", "B", "C", "D", "E"] quality: - type: custom engine: great-expectations description: "Accepted Values for type" implementation: expectation_type: expect_column_values_to_be_in_set kwargs: value_set: ["A", "B", "C", "D", "E"] meta: notes: "Ensures that type values are one of the accepted values."

The current output only includes type checks and ordered column validation:

{ "expectations": [ { "expectation_type": "expect_table_columns_to_match_ordered_list", "kwargs": { "column_list": ["id", "type"] }, "meta": {} }, { "expectation_type": "expect_column_values_to_be_of_type", "kwargs": { "column": "id", "type_": "string" }, "meta": {} }, { "expectation_type": "expect_column_values_to_be_of_type", "kwargs": { "column": "type", "type_": "string" }, "meta": {} } ], "meta": {} }

Expected Output: The generated Great Expectations schema should also capture the custom quality rules, such as:

**expect_column_values_to_be_unique for the id column

expect_column_values_to_be_in_set for the type column**

Proposed Fix: Ensure that all quality rules under each field are included in the exported Great Expectations schema, in addition to the default schema and type checks.

ReshuVishwakarma avatar Apr 03 '25 12:04 ReshuVishwakarma

@SimonAuger could you take a look at this issue?

jochenchrist avatar Apr 03 '25 13:04 jochenchrist

Could you please let me know its current status and when we might expect a resolution?

We are using the datacontract-cli export command with GE to perform checks, and—as outlined in the issue—some of these checks are currently being missed. Any update you can provide would be greatly appreciated.

ReshuVishwakarma avatar Apr 08 '25 16:04 ReshuVishwakarma

Hello @ReshuVishwakarma, thanks for opennings this issue. You are absolutely right. Currently, custom quality checks for GE can only be provided in JSON format and are merged with the auto-generated checks based on the model definition. Which include validations for type, uniqueness, and length.

For your use case, here’s an example of how the data contract could be structured:

dataContractSpecification: 1.1.0
id: my-data-contract-id
info:
  title: Orders Unit Test
  version: 1.0.0
  owner: checkout
  description: The orders data contract
  contact:
    email: [email protected]
    url: https://wiki.example.com/teams/checkout
models:
  orders:
    description: test
    fields:
      id:
        description: Unique identifier for each alert.
        type: string
        required: true
        primaryKey: true
      type:
        description: The type of alert that has fired.
        type: string
        required: true
        enum: [ "A", "B", "C", "D", "E" ]
quality:
  type: great-expectations
  specification:
    orders: |-
      [
        {
          "expectation_type": "expect_column_values_to_be_in_set",
          "kwargs": {
            "column": "type",
            "value_set": ["A", "B", "C", "D", "E"]
          },
          "meta": {
            "notes": "Ensures that type values are one of the accepted values."
          }
        }
      ]

sauger92 avatar Apr 11 '25 15:04 sauger92

I will try to find a moment next week to create a Pull Request to support GE quality checks yaml definition directly in fields

sauger92 avatar Apr 11 '25 15:04 sauger92

It would look to something like this. Would that be ok for you @jochenchrist and @ReshuVishwakarma ?

dataContractSpecification: 1.1.0
id: my-data-contract-id
info:
  title: Orders Unit Test
  version: 1.0.0
  owner: checkout
  description: The orders data contract
  contact:
    email: [email protected]
    url: https://wiki.example.com/teams/checkout
models:
  orders:
    description: test
    fields:
      id:
        description: Unique identifier for each alert.
        type: string
        required: true
        primaryKey: true
        unique: true
      type:
        description: The type of alert that has fired.
        type: string
        required: true
        enum: [ "A", "B", "C", "D", "E" ]
        quality:
          - type: custom
            engine: great-expectations
            description: "Accepted Values for type"
            implementation:
              expectation_type: expect_column_values_to_be_in_set
              kwargs:
                value_set: [ "A", "B", "C", "D", "E" ]
              meta:
                notes: "Ensures that type values are one of the accepted values."

sauger92 avatar Apr 11 '25 16:04 sauger92

It would look to something like this. Would that be ok for you @jochenchrist and @ReshuVishwakarma ?

dataContractSpecification: 1.1.0 id: my-data-contract-id info: title: Orders Unit Test version: 1.0.0 owner: checkout description: The orders data contract contact: email: [email protected] url: https://wiki.example.com/teams/checkout models: orders: description: test fields: id: description: Unique identifier for each alert. type: string required: true primaryKey: true unique: true type: description: The type of alert that has fired. type: string required: true enum: [ "A", "B", "C", "D", "E" ] quality: - type: custom engine: great-expectations description: "Accepted Values for type" implementation: expectation_type: expect_column_values_to_be_in_set kwargs: value_set: [ "A", "B", "C", "D", "E" ] meta: notes: "Ensures that type values are one of the accepted values."

Hi @sauger92, thank you for sharing the single‑level pattern — it looks clear and aligns with our needs. However, as I’ve observed, in the current export the auto-generated type checks are only applied at the parent level and do not generate checks for nested fields. Could you please confirm that the same validation logic will be extended to nested structure like arrays (e.g. we should also generate the validation check for the nested priorityOrder field inside each rulesTriggered item.)? For example:

rulesTriggered:
  type: array
  required: true
  items:
    type: object
    fields:
      id:
        type: string
        required: true
      name:
        type: string
        required: true
      priorityOrder:
        type: integer
        required: true
        quality:
          - type: custom
            engine: great-expectations
            description: "Range Check on priorityOrder"
            implementation:
              expectation_type: expect_column_values_to_be_between
              kwargs:
                min_value: 0
                max_value: 30

ReshuVishwakarma avatar Apr 12 '25 16:04 ReshuVishwakarma

Looping Mark for the visibility of this issue. @pixie79

ReshuVishwakarma avatar Apr 14 '25 10:04 ReshuVishwakarma

This seems like a duplication as you already have the enum with the list of values, it would make more sense to automatically add this rule to the export if there is an enum field than declare it as a separate manual rule. We are not trying to validate that someone can write the correct enum list after all as in reality all they will do is copy the line and change the name to read value_set. Personally I don’t thing this should be a custom rule but that it should be auto generated. Is there a good reason not too?

pixie79 avatar Apr 15 '25 03:04 pixie79

This seems like a duplication as you already have the enum with the list of values, it would make more sense to automatically add this rule to the export if there is an enum field than declare it as a separate manual rule. We are not trying to validate that someone can write the correct enum list after all as in reality all they will do is copy the line and change the name to read value_set. Personally I don’t thing this should be a custom rule but that it should be auto generated. Is there a good reason not too?

Hi @pixie79 ,

Thank you for the detailed observation. I understand the concern regarding the duplication when we already have an enum defined with its list of values. Your point about avoiding the manual declaration of a separate custom rule for enum is well taken. It does make sense to have the rule be auto-generated based on the enum field.

For instance:

dataContractSpecification: 1.1.0
id: my-data-contract-id
info:
  title: Orders Unit Test
  version: 1.0.0
  owner: checkout
  description: The orders data contract
  contact:
    email: [email protected]
    url: https://wiki.example.com/teams/checkout
models:
  orders:
    description: test
    fields:
      id:
        description: Unique identifier for each alert.
        type: string
        required: true
        primaryKey: true
        unique: true
      type:
        description: The type of alert that has fired.
        type: string
        required: true
        enum: '#/definitions/type_enum'
        
definitions:
  type:
    title: type_enum
    type: string
    required: false
    enum:
      - "A"
      - "B"
      - "C"
      - "D"     
    config:
      avroType: enum

In this setup, the enum field for type should automatically trigger the generation of the appropriate value_set validation rules during export. This approach not only simplifies the configuration but also minimizes the risk of inconsistencies that might arise from manually duplicating data. However, we can consider custom rules to cases that involve range-related checks, minimum value checks, etc. as mentioned in the previous comment for PriorityOrder column.

ReshuVishwakarma avatar Apr 15 '25 09:04 ReshuVishwakarma

@sauger92 are you good with the updated details or need further clarifications?

ReshuVishwakarma avatar Apr 15 '25 09:04 ReshuVishwakarma

I just want to point out that the following

      type:
        description: The type of alert that has fired.
        type: string
        required: true
        enum: '#/definitions/type_enum'

will not work with the Data Contract CLI, but

      type:
        description: The type of alert that has fired.
        type: string
        required: true
        $ref: '#/definitions/type_enum'

will.

Regardless, I second both, the support for quality checks of gx at the field level, and the automatic creation of the quality check based on the enum field. @sauger92 thanks for your help already! :-)

simonharrer avatar Apr 16 '25 18:04 simonharrer

Hi, I have created the Pull Request. Sorry for the delay I had a lot of works last few weeks. Do not hesitate to add feedbacks https://github.com/datacontract/datacontract-cli/pull/738

sauger92 avatar Apr 26 '25 10:04 sauger92

#738 was merged

jochenchrist avatar Jul 06 '25 07:07 jochenchrist