squared Spike - Data Platform Monitoring

Spike - Data Platform Monitoring

Open pnadolny13 opened this issue 2 years ago • 1 comments

To get the data platform into a stable/reliable place we need a way to monitor the data flow through and alert when things arent right.

Additionally consider whether we need to implement this or if these would be resolved by migrating to managed meltano. It could inform some managed meltano features.

Infra/jobs:

Airflow jobs run
Airflow errors
k8s health
Snowflake query time and queue size
Snowflake warehouse consumption and billing
Snowpipe volume and copy errors

Data:

dbt models run, artifacts to know size of tables and runtime
dbt freshness
dbt tests
dbt errors
Snowpipe bad messages
EL tap volume - i.e. is a tap silently syncing nothing?
Snowflake observability and anomaly detection i.e. did we get a huge drop in row counts vs last week?

Alerting:

slack?
CI failure sends emails, slack also?

Aug 23 '22 15:08 pnadolny13

Similar to https://github.com/meltano/squared/issues/114 which is particularly related to data observability answering the "did we get a huge drop in row counts vs last week?" question.

Aug 23 '22 15:08 pnadolny13

squared squared copied to clipboard

Spike - Data Platform Monitoring

squared
squared copied to clipboard