elementary icon indicating copy to clipboard operation
elementary copied to clipboard

[ELE-1932] Support group_by in the column_anomalies test

Open haritamar opened this issue 1 year ago • 1 comments

The goal is to be able to group by dimensions for the column_anomalies test, similarly to how the dimension_anomalies test provides this mechanism for volume.

Suggested interface:

- elementary.column_anomalies:
    column_anomalies:
      - zero_count
    group_by: platform

or by multiple dimensions:

- elementary.column_anomalies:
    column_anomalies:
      - zero_count
    group_by:
      - platform
      - device

NOTE - a part of this feature would be to consider how it should look in the UI, which I'm not going to cover here (needs to be discussed with the Elementary product team), but a good POC can be to return a similar structure to that of dimension_anomalies if group_by is included.

Suggested approach - I think it will be good to see how dimensions are handled in the dimension anomalies test, and incorporate this logic in the other anomaly tests via a "group_by" parameter.

Relevant pointers in the dbt package repository (dbt_data_reliability):

  • Main implementation of the tests:
    • test_column_anomalies.sql - Main implementation of the column anomalies test
    • test_dimension_anomalies.sql - Implementation of the existing dimension anomalies test
  • Monitoring queries
    • column_monitoring_query - This file is in charge of computing the various column metrics Elementary supports.
    • dimension_monitoring_query - This file is the monitoring query for the dimension_anomalies test, which has implementation specific to volume (row count). I think it would be good to see if we can add "group by" functionality to column_monitoring_query and not extend dimension_monitoring_query.
  • Relevant integration tests:
    • We have tests for both tests - see test_dimension_anomalies.py and test_column_anomalies.py

Relevant code in the CLI (elementary):

  • get_test_results - You can see how results are handled for dimension anomalies, we should handle it similarly for the column test.

ELE-1932

haritamar avatar Oct 29 '23 23:10 haritamar

This feature would be a game changer for us. Being able to look at value based anomalies on grouped by data is a critical feature, not just for us I’m sure.

Take an example where you have sensors deployed in the real world gathering data, and you have a monolithic table in your database that stores data from all sensors, we’d like to use the column anomaly features in Elementary to look at the quality of each individual sensor over time, which we would be able to do if there was an implementation of this Github feature request.

futurebenmorris avatar Mar 25 '24 18:03 futurebenmorris

Closing since this has been recently implemented! (As the dimensions parameter)

haritamar avatar May 29 '24 00:05 haritamar