great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Nested data support

Open DCastile opened this issue 3 years ago • 7 comments

Is your feature request related to a problem? Please describe. Currently in GE there is no support for nested data inside a cell of a dataset. However many datasets contain nested/hierarchal data in these cells and these values must be quality checked. A work around is to pre-process the data prior to performing an expectation however this creates unnecessary separation between the original dataset you want to quality check and expectations, thus increasing complexity of the system greatly.

Describe the solution you'd like If we have nested data inside of a field that we want to place an expectation on we should be able to use dot or another notation to tell GE we are talking about a nested field.

As an example if we have input schema (below), where name and residence are top level keys. These could also be thought of as columns when displayed as tabular dataset or dataframe. Inside of each column we would have a nested map data structure with value's we'd like to create expectations for.

As an example if we'd like to ensure there are only ascii alpha characters in name.first we could write: expect_column_values_to_match_regex('name.first', 'a-zA-Z')

{
    name: {
        first: john,
        last: smith
    },
    residence: {
        state: CA,
        city: Pleasanton
    }
}

This next part could be a reach, but it could bring some clever and graceful solutions to complex problems. This concept could be applied to nested fields that have multiplicity, notice residences is now an array. If we wanted to check every residence state we could use notation like so. expect_table_columns_to_match_set('residences.#.state', state_codes)

{
    name: {
        first: john,
        last: smith
    },
    residences: [
        {
            state: CA,
            city: Pleasanton
        },
        {
            state: AZ,
            city: Phoenix
        }
    ]
}

Describe alternatives you've considered Pre-processing the data to normalize or unnest fundamentally means you are no longer creating expectations for the target dataset, you are creating it on some other dataset and now need to maintain and understand lineage. Bad solution

Additional context

  • Our organization uses exclusively nested data so we are particularly interested in this functionality.
  • This is a reopening of #2231
  • The majority of the backends used by GE support semi-structured/nested data operations natively (snowflake, big query, redshift, postgres, spark, etc), all use some type of dot notation (snowflake uses : instead of .)

DCastile avatar Jan 10 '22 19:01 DCastile

Ditto this. Support for string representations of JSON would be incredibly useful.

sarahmk125 avatar Jan 10 '22 19:01 sarahmk125

Besides JSON, support for nested data on PARQUET/ORC/DELTA would also be very welcomed. I would say that adding support for systems like Snowflake, AWS Glue/Hive, Redshit that already support these nested structures has a higher a priority.

ricardogaspar2 avatar Jan 12 '22 11:01 ricardogaspar2

Thanks for opening this issue, @DCastile, and for the context, @sarahmk125, @ricardogaspar2! We will review internally and be in touch.

talagluck avatar Jan 12 '22 17:01 talagluck

I would like to reinforce that this functionality would be very useful for my teams work as well. We have highly nested data and need to make validations from it. We started to evaluate the use of GE and ran into this limitation. Possibly we adopt it for our tabular data, but it would be very useful if we could also adopt it for nested data.

icapetti avatar Feb 27 '22 13:02 icapetti

Like @icapetti , this functionality would be a game changer for my team as well. There is a growing need among the organizations we work with, to monitor data quality and validate data against data standards and schemas that are nested. The lack of support for nested data in GE is a major obstacle for many clients because they want to validate their data in their native format.

moreymat avatar Mar 26 '22 11:03 moreymat

Thanks, all! I had discussed this with @DCastile, and we currently have basic support for this through existing Spark functionality, though it is somewhat limited (it's possible to work data which is entirely keyed, but there isn't existing notation to access data in arrays). I believe there were some additional questions about whether this was possible with SQL Alchemy more broadly, which could unlock this use case.

This would be a great feature, though it will likely not be something that we'll be able to prioritize in the immediate future. If anyone is interested in making a contribution in the nearer term, we are happy to offer guidance and support wherever it may be needed.

talagluck avatar Mar 28 '22 15:03 talagluck

Is this issue still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity.

It will be closed if no further activity occurs. Thank you for your contributions 🙇

github-actions[bot] avatar Aug 05 '22 02:08 github-actions[bot]

This issue is still relevant.

retrry avatar Oct 28 '22 07:10 retrry

If I am importing my expectations from JSON Schema, and the JSON Schema reflects the hierarchically nested objects, then flattening it is no longer as simple of a solution as it sounds.

oogetyboogety avatar Oct 31 '22 22:10 oogetyboogety

@talagluck I dont mind taking a stab at this, would you be able to provide me some pointers on where to look at?

manugarri avatar Nov 09 '22 18:11 manugarri

Very important use case for our project, and specifically on the nested array like the following: {"test": [ {"a":1,"b":2}, {"a":1,"b":3}, {"a":1,"b":2}, {"a":5,"b":2}, {"a":1,"b":10}, ] }

Is there any progress on this?

s-agarbhi avatar Jan 20 '23 22:01 s-agarbhi

Hey all, at this point we're not considering support for nested data sources or accepting contribution as this area has multiple formidable challenges and will involve some significant architectural changes. Once we're ready to tackle the issue, we'll make an announcement.

rdodev avatar Mar 07 '23 19:03 rdodev