squared
squared copied to clipboard
Spike - Data Platform Monitoring
To get the data platform into a stable/reliable place we need a way to monitor the data flow through and alert when things arent right.
Additionally consider whether we need to implement this or if these would be resolved by migrating to managed meltano. It could inform some managed meltano features.
Infra/jobs:
- Airflow jobs run
- Airflow errors
- k8s health
- Snowflake query time and queue size
- Snowflake warehouse consumption and billing
- Snowpipe volume and copy errors
Data:
- dbt models run, artifacts to know size of tables and runtime
- dbt freshness
- dbt tests
- dbt errors
- Snowpipe bad messages
- EL tap volume - i.e. is a tap silently syncing nothing?
- Snowflake observability and anomaly detection i.e. did we get a huge drop in row counts vs last week?
Alerting:
- slack?
- CI failure sends emails, slack also?
Similar to https://github.com/meltano/squared/issues/114 which is particularly related to data observability answering the "did we get a huge drop in row counts vs last week?" question.