keep
keep copied to clipboard
[š Bug]: Correlation & Incidents: Backend - Frontend mismatched
Describe the bug I create many rules in the Correlations tab, which used to create several Incidents as I wanted. However, since our pentest team sent test alert with Keep's webhook URL, there are these behaviours:
-
When I tried to reach Correlations tab, it always show "Just a second, getting your data šØ" for like forever, and even if it passed, I only see the default lay-out of Correlations tab, without any rules I created.
-
Then I turned to the Incidents tab, it also took so long to load, and shows incidents which are created from older rules (I deleted those a long time ago) with an error font.
The most common cause for persistent loading, missing rules, and stale incidents in Keep is a backend authentication issueāspecifically, missing or misconfigured OAuth2 proxy headers between the frontend and backend pods. If the backend doesn't receive the expected user header (like x-auth-request-email), API requests fail with 401 errors, which leads to infinite loading and empty UI states in Correlations, Incidents, and Facets tabs. Double-check that your frontend and backend are both configured to send and expect the correct headers, and that environment variables like AUTH_TYPE=OAUTH2PROXY, KEEP_OAUTH2_PROXY_USER_HEADER, and KEEP_OAUTH2_PROXY_ROLE_HEADER are set appropriately. You can find more details and troubleshooting steps in this issue.
For incidents showing up from deleted rules, Keep uses a soft deletion flag (is_deleted) for rules. If the frontend or backend doesn't properly filter out deleted rules, incidents from old rules may still appear. The incident model links to rules with a foreign key and should cascade deletes, but stale data can persist if the UI or API doesn't update correctly. Reviewing your database state for rules and incidents can help confirm if deleted items are still present or incorrectly marked as visible (code reference).
Facets rely on backend endpoints that aggregate incident data. If the main data state is inconsistent or API requests fail, facets won't load. Check backend logs for errors when accessing these endpoints, and look for frontend console errors related to CEL queries or state management (facet loading bug).
If authentication and data state look correct, try creating a new rule and see if it appears in the Correlations tabāthis can help isolate whether the issue is with data state, authentication, or frontend logic.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report š | OtherĀ Ā Ā
Ā
hey @tmqu130 - I'll need backend logs to understand what's going on.
feel free to reach out on our Slack (https://getkeep.slack.com/)
@dosu
There is a problem: I use a workflow to push alerts from our database to Keep. Though the amount of data is small, keepdb keeps scaling larger and larger, I don't know why, and how to address that.
The main reason keepdb keeps growing is that Keep does not automatically delete old alerts or incidentsārecords are only soft-deleted (flagged, not removed), and thereās no built-in retention policy or cleanup job by default. This means even a small number of incoming alerts can cause the database to grow over time, especially if your workflow generates related incidents, logs, or metadata for each alert. Manual cleanup is required unless you set up your own process.
To address this:
- Identify which tables are growing fastest (typically
alert,alert_raw,incident, and related tables). You can use SQL queries or your DB admin tool to check row counts and storage usage. - Check for old or soft-deleted records (look for fields like
is_deleted,deleted_at, or incidentstatus='deleted'). - Thereās a community-provided Helm chart cronjob for database cleanup that you can adapt for your environment (example here)āthis can be used to periodically purge old or deleted records from the database.
- Manual deletion of alerts/incidents is possible via direct SQL, but thereās currently no UI for this (feature request).
- Review your workflow logic to ensure it isnāt generating redundant or excessive alerts/incidents.
Thereās an open feature request for configurable retention policies and automatic cleanup (discussion), but for now, regular manual or scripted cleanup is needed to control keepdb size.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report š | OtherĀ Ā Ā
Ā
@dosu
This is my current workflow.yaml to push alerts: workflow: id: clickhouse-push-critical-alerts-to-keep name: Push New Critical Alerts from ClickHouse to Keep description: "Lįŗ„y cĆ”c alert critical mį»i từ bįŗ£ng ds_alert, Äįŗ©y lĆŖn Keep Feed" disabled: false triggers: - type: interval value: 2m consts: workflow_tag: alerts steps: - name: get-new-critical-alerts provider: type: clickhouse config: "{{ providers.alert-incidents }}" with: query: "SELECT any(id) AS alert_id, tags, lower(severity) AS severity, topo_set, toTimeZone(created_at, 'Asia/Ho_Chi_Minh') AS event_date, description, noc_extra_f1, topo_ots_set, service_index, topo_ring_set, obj_id, ots_ring, region, province, pop_id, device_type, service, device_group, hostname FROM alert_incidents.ds_alert WHERE created_at >= now() - INTERVAL 6 MINUTE GROUP BY tags, severity, topo_set, event_date, description, noc_extra_f1, topo_ots_set, service_index, topo_ring_set, obj_id, ots_ring, region, province, pop_id, device_type, service, device_group, hostname FORMAT JSONEachRow;\n" actions: - name: create-alert-on-keep-feed foreach: "{{ steps.get-new-critical-alerts.results }}" provider: type: keep config: "{{ providers.default-keep }}" with: alert: name: "{{ foreach.value.obj_id }}" status: firing severity: "{{ foreach.value.severity }}" description: "{{ foreach.value.description }}" timestamp: "{{ foreach.value.event_date }}" labels: alert_id: "{{ foreach.value.alert_id }}" tags: "{{ foreach.value.tags }}" topo_set: "{{ foreach.value.topo_set }}" noc_extra_f1: "{{ foreach.value.noc_extra_f1 }}" topo_ots_set: "{{ foreach.value.topo_ots_set }}" service_index: "{{ foreach.value.service_index }}" topo_ring_set: "{{ foreach.value.topo_ring_set }}" obj_id: "{{ foreach.value.obj_id }}" ots_ring: "{{ foreach.value.ots_ring }}" region: "{{ foreach.value.region }}" province: "{{ foreach.value.province }}" pop_id: "{{ foreach.value.pop_id }}" device_type: "{{ foreach.value.device_type }}" service: "{{ foreach.value.service }}" device_group: "{{ foreach.value.device_group }}" hostname: "{{ foreach.value.hostname }}" description: "{{ foreach.value.description }}" event_date: "{{ foreach.value.event_date }}" workflow_tag: "{{ consts.workflow_tag }}" fingerprint: "{{ foreach.value.alert_id }}"
Could you point out the problems, or give me a guide to prevent keepdb auto-scaling up?
Your workflow sets the alert fingerprint to alert_id, which is good for deduplicationāif alert_id is stable and unique for each logical alert, Keep will only store one active alert per fingerprint. However, if your query or workflow generates new/different alert_ids every run (for example, if any(id) picks a different value or if the grouping changes), you'll end up with many unique fingerprints and rapid database growth.
Keep does not automatically delete old alerts or incidents; records are only soft-deleted, and thereās no built-in retention or cleanup job by default. Even with low data volume, the database will grow unless you periodically clean up old or deleted records. You can use a community-provided Helm chart cronjob to automate cleanup (example here), or run manual SQL deletes for old/soft-deleted alerts and incidents. Thereās an open feature request for retention policies and automatic cleanup (discussion), but for now, regular manual or scripted cleanup is needed to control keepdb size.
To prevent bloat: double-check that your fingerprint logic is stable (ideally, use a business-meaningful ID that doesnāt change for the same alert), monitor the number of unique fingerprints in your database, and set up periodic cleanup for old/soft-deleted records. If you need help tuning your query or cleanup scripts, let me know!
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report š | OtherĀ Ā Ā
Ā
@dosu
I think there's a big problem you've missed: table logs. Is there a guild/example to create an auto-clean table logs cron job? Like with old alerts.
You're rightālogs tables can cause significant database bloat in Keep, and there isn't a built-in or documented cron job for cleaning them up. You'll need to implement your own scheduled cleanup targeting tables like ProviderExecutionLog, WorkflowExecutionLog, and EnrichmentLog, all of which have a timestamp field you can use for retention.
For example, to delete logs older than 30 days, you can run:
DELETE FROM providerexecutionlog WHERE timestamp < NOW() - INTERVAL '30 days';
DELETE FROM workflowexecutionlog WHERE timestamp < NOW() - INTERVAL '30 days';
DELETE FROM enrichmentlog WHERE timestamp < NOW() - INTERVAL '30 days';
You can automate this using a Kubernetes CronJob, a system cron job with psql, or your database's internal scheduler. Pick a retention period that fits your needsāthere's no official recommendation yet. This approach is similar to how users handle alert/incident cleanup, but you'll need to adapt the process for logs since there's no community Helm chart or template for logs cleanup yet.
Let me know if you want a sample CronJob manifest or script!
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report š | OtherĀ Ā Ā
Ā
@dosu I'm trying to connect my Clickhouse with Providers: Provider Name* push-alerts
Authentication Clickhouse username* abc Clickhouse password* xyz Clickhouse hostname* {ip} Clickhouse port* {port} Protocol ('clickhouses' for SSL, 'clickhouse' for no SSL, 'http' or 'https')* clickhouses
Enable SSL verification: tick SSL verification is enabled by default
Clickhouse database name database
-> Connect -> Connection Problem clickhouse scopes are invalid: { "connect_to_server": "Code: 209. ({ip}:{port)" }
The error clickhouse scopes are invalid: { "connect_to_server": "Code: 209. ({ip}:{port)" } means Keep failed to connect to your ClickHouse server and execute a test query (SHOW TABLES). In ClickHouse, Code: 209 usually signals a network, authentication, or protocol mismatch during connection.
Hereās what to check:
- Make sure the protocol (
clickhouses) matches your serverās SSL setup and that youāre using the correct SSL port (typically 9440 for native SSL, not 8123 which is for HTTPS) setup reference. - The hostname should be just the address (no
http://orhttps://prefix). - Double-check username, password, and database values.
- If your server uses a self-signed certificate, try disabling SSL verification (
verify: false) in the provider config to see if that resolves the issue (not recommended for production, but useful for testing). - Make sure Keep can reach the ClickHouse server over the network (no firewall or routing issues).
- If you still get the error, check your ClickHouse server logs for more details on why the connection is refused or fails.
The provider passes back any connection error directly, so any mismatch in protocol, port, SSL, or credentials will show up as a connection failure like this code reference.
If you want to share your exact provider config (with sensitive info redacted), I can help spot any misconfigurations.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report š | OtherĀ Ā Ā
Ā