great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Idea for checking if all distinct value of a foreign key of a table is included in its dimension table or not

Open derek-hk opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. Problem statement: It is to check if all distinct values of a fact table's key column (says, product key) is included in the dimension master table or not, which can be checked by expect_column_values_to_be_in_set, which turns the value set into an python list, regardless of the performance. However, it is unrealistic in an enterprise level relational database or a data warehouse, as the computation happens in the python machine, which turns a multitude of tuple into either a true or a false, so that OOM could be easily hit.

Describe the solution you'd like Solution: Do the computation by the database's calculating power. For instance, suppose there are

  • a fact table F_SALES <order_key, product_id, sales_value>
  • a dimension table D_PRODUCT <product_id, product_name, supplier_name> We are interested if the entire F_SALES.product_id is included in D_PRODUCT.product_id or not. Why don't we pass the true false calculation to the database/data warehouse? i.e. calculating the following SQL:
select max(if(d.product_key is null)) from F_SALES s
left join D_PRODUCT d on s.product_key = d.product_key

Explanation of the proposed solution if doing max() on a T/F column which contains only one true but a lot of true, it would still giving true Therefore, if one key cannot be matched, the boolean will give true, which indicated an "out of expectation" in this approach, only 4 conditions is needed: main table(F_SALES), reference table(D_PRODUCT), join key of maintable, join key of lookup table So the SQL becomes,

select max(if(<join key of the lookup table> is null)) from <main table> m
left join <lookup table> l on l.<join key of lookup table> = m.<join key of main table>

derek-hk avatar Oct 11 '21 02:10 derek-hk

Thanks for opening this, @derek-hk! We will review and be in touch soon.

talagluck avatar Oct 18 '21 14:10 talagluck

Thanks for you patience here @derek-hk! After reviewing your proposed changes, this is something we'd love to see as part of our Expectation functionality but we can't prioritize it at the moment.

Is this something you'd be interested in contributing this as a prototype Expectation in contrib/? This could definitely provide some very useful functionality to our users.

cdkini avatar Nov 24 '21 17:11 cdkini

Looking forward for this feature as well :) In my feeling it could be really useful as I quite often have the situation that I want to check if "ids" from one .csv [some table] are included in another .csv [master table/dictionary]. Thanks!

NykPol avatar Feb 02 '22 16:02 NykPol

+1 for this feature

stevemeckstroth-8451 avatar Mar 28 '22 14:03 stevemeckstroth-8451

Hi @derek-hk and others - thanks for raising this! This is great feedback. As mentioned above, this isn't something that we will be able to implement in the near-term, but we've logged this feedback to consider when we are working in the area. I'm going to close this issue for now, but if you or anyone else is interested in making a contribution, please free to re-open this issue and let us know.

talagluck avatar Mar 08 '23 13:03 talagluck