keep [🐛 Bug]: Correlation & Incidents: Backend

Describe the bug I create many rules in the Correlations tab, which used to create several Incidents as I wanted. However, since our pentest team sent test alert with Keep's webhook URL, there are these behaviours:

When I tried to reach Correlations tab, it always show "Just a second, getting your data 🚨" for like forever, and even if it passed, I only see the default lay-out of Correlations tab, without any rules I created.
Then I turned to the Incidents tab, it also took so long to load, and shows incidents which are created from older rules (I deleted those a long time ago) with an error font.

Aug 12 '25 08:08 tmqu130

The most common cause for persistent loading, missing rules, and stale incidents in Keep is a backend authentication issue—specifically, missing or misconfigured OAuth2 proxy headers between the frontend and backend pods. If the backend doesn't receive the expected user header (like x-auth-request-email), API requests fail with 401 errors, which leads to infinite loading and empty UI states in Correlations, Incidents, and Facets tabs. Double-check that your frontend and backend are both configured to send and expect the correct headers, and that environment variables like AUTH_TYPE=OAUTH2PROXY, KEEP_OAUTH2_PROXY_USER_HEADER, and KEEP_OAUTH2_PROXY_ROLE_HEADER are set appropriately. You can find more details and troubleshooting steps in this issue.

For incidents showing up from deleted rules, Keep uses a soft deletion flag (is_deleted) for rules. If the frontend or backend doesn't properly filter out deleted rules, incidents from old rules may still appear. The incident model links to rules with a foreign key and should cascade deletes, but stale data can persist if the UI or API doesn't update correctly. Reviewing your database state for rules and incidents can help confirm if deleted items are still present or incorrectly marked as visible (code reference).

Facets rely on backend endpoints that aggregate incident data. If the main data state is inconsistent or API requests fail, facets won't load. Check backend logs for errors when accessing these endpoints, and look for frontend console errors related to CEL queries or state management (facet loading bug).

If authentication and data state look correct, try creating a new rule and see if it appears in the Correlations tab—this can help isolate whether the issue is with data state, authentication, or frontend logic.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Aug 12 '25 08:08 dosubot[bot]

hey @tmqu130 - I'll need backend logs to understand what's going on.

feel free to reach out on our Slack (https://getkeep.slack.com/)

Aug 12 '25 09:08 shahargl

@dosu

There is a problem: I use a workflow to push alerts from our database to Keep. Though the amount of data is small, keepdb keeps scaling larger and larger, I don't know why, and how to address that.

Aug 18 '25 04:08 tmqu130

The main reason keepdb keeps growing is that Keep does not automatically delete old alerts or incidents—records are only soft-deleted (flagged, not removed), and there’s no built-in retention policy or cleanup job by default. This means even a small number of incoming alerts can cause the database to grow over time, especially if your workflow generates related incidents, logs, or metadata for each alert. Manual cleanup is required unless you set up your own process.

To address this:

Identify which tables are growing fastest (typically alert, alert_raw, incident, and related tables). You can use SQL queries or your DB admin tool to check row counts and storage usage.
Check for old or soft-deleted records (look for fields like is_deleted, deleted_at, or incident status='deleted').
There’s a community-provided Helm chart cronjob for database cleanup that you can adapt for your environment (example here)—this can be used to periodically purge old or deleted records from the database.
Manual deletion of alerts/incidents is possible via direct SQL, but there’s currently no UI for this (feature request).
Review your workflow logic to ensure it isn’t generating redundant or excessive alerts/incidents.

There’s an open feature request for configurable retention policies and automatic cleanup (discussion), but for now, regular manual or scripted cleanup is needed to control keepdb size.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Aug 18 '25 04:08 dosubot[bot]

@dosu

This is my current workflow.yaml to push alerts: workflow: id: clickhouse-push-critical-alerts-to-keep name: Push New Critical Alerts from ClickHouse to Keep description: "Lấy các alert critical mới từ bảng ds_alert, đẩy lên Keep Feed" disabled: false triggers: - type: interval value: 2m consts: workflow_tag: alerts steps: - name: get-new-critical-alerts provider: type: clickhouse config: "{{ providers.alert-incidents }}" with: query: "SELECT any(id) AS alert_id, tags, lower(severity) AS severity, topo_set, toTimeZone(created_at, 'Asia/Ho_Chi_Minh') AS event_date, description, noc_extra_f1, topo_ots_set, service_index, topo_ring_set, obj_id, ots_ring, region, province, pop_id, device_type, service, device_group, hostname FROM alert_incidents.ds_alert WHERE created_at >= now() - INTERVAL 6 MINUTE GROUP BY tags, severity, topo_set, event_date, description, noc_extra_f1, topo_ots_set, service_index, topo_ring_set, obj_id, ots_ring, region, province, pop_id, device_type, service, device_group, hostname FORMAT JSONEachRow;\n" actions: - name: create-alert-on-keep-feed foreach: "{{ steps.get-new-critical-alerts.results }}" provider: type: keep config: "{{ providers.default-keep }}" with: alert: name: "{{ foreach.value.obj_id }}" status: firing severity: "{{ foreach.value.severity }}" description: "{{ foreach.value.description }}" timestamp: "{{ foreach.value.event_date }}" labels: alert_id: "{{ foreach.value.alert_id }}" tags: "{{ foreach.value.tags }}" topo_set: "{{ foreach.value.topo_set }}" noc_extra_f1: "{{ foreach.value.noc_extra_f1 }}" topo_ots_set: "{{ foreach.value.topo_ots_set }}" service_index: "{{ foreach.value.service_index }}" topo_ring_set: "{{ foreach.value.topo_ring_set }}" obj_id: "{{ foreach.value.obj_id }}" ots_ring: "{{ foreach.value.ots_ring }}" region: "{{ foreach.value.region }}" province: "{{ foreach.value.province }}" pop_id: "{{ foreach.value.pop_id }}" device_type: "{{ foreach.value.device_type }}" service: "{{ foreach.value.service }}" device_group: "{{ foreach.value.device_group }}" hostname: "{{ foreach.value.hostname }}" description: "{{ foreach.value.description }}" event_date: "{{ foreach.value.event_date }}" workflow_tag: "{{ consts.workflow_tag }}" fingerprint: "{{ foreach.value.alert_id }}"

Could you point out the problems, or give me a guide to prevent keepdb auto-scaling up?

Aug 18 '25 07:08 tmqu130

Your workflow sets the alert fingerprint to alert_id, which is good for deduplication—if alert_id is stable and unique for each logical alert, Keep will only store one active alert per fingerprint. However, if your query or workflow generates new/different alert_ids every run (for example, if any(id) picks a different value or if the grouping changes), you'll end up with many unique fingerprints and rapid database growth.

Keep does not automatically delete old alerts or incidents; records are only soft-deleted, and there’s no built-in retention or cleanup job by default. Even with low data volume, the database will grow unless you periodically clean up old or deleted records. You can use a community-provided Helm chart cronjob to automate cleanup (example here), or run manual SQL deletes for old/soft-deleted alerts and incidents. There’s an open feature request for retention policies and automatic cleanup (discussion), but for now, regular manual or scripted cleanup is needed to control keepdb size.

To prevent bloat: double-check that your fingerprint logic is stable (ideally, use a business-meaningful ID that doesn’t change for the same alert), monitor the number of unique fingerprints in your database, and set up periodic cleanup for old/soft-deleted records. If you need help tuning your query or cleanup scripts, let me know!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Aug 18 '25 07:08 dosubot[bot]

@dosu

I think there's a big problem you've missed: table logs. Is there a guild/example to create an auto-clean table logs cron job? Like with old alerts.

Aug 18 '25 08:08 tmqu130

You're right—logs tables can cause significant database bloat in Keep, and there isn't a built-in or documented cron job for cleaning them up. You'll need to implement your own scheduled cleanup targeting tables like ProviderExecutionLog, WorkflowExecutionLog, and EnrichmentLog, all of which have a timestamp field you can use for retention.

For example, to delete logs older than 30 days, you can run:

DELETE FROM providerexecutionlog WHERE timestamp < NOW() - INTERVAL '30 days';
DELETE FROM workflowexecutionlog WHERE timestamp < NOW() - INTERVAL '30 days';
DELETE FROM enrichmentlog WHERE timestamp < NOW() - INTERVAL '30 days';

You can automate this using a Kubernetes CronJob, a system cron job with psql, or your database's internal scheduler. Pick a retention period that fits your needs—there's no official recommendation yet. This approach is similar to how users handle alert/incident cleanup, but you'll need to adapt the process for logs since there's no community Helm chart or template for logs cleanup yet.

Let me know if you want a sample CronJob manifest or script!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Aug 18 '25 08:08 dosubot[bot]

@dosu I'm trying to connect my Clickhouse with Providers: Provider Name* push-alerts

Authentication Clickhouse username* abc Clickhouse password* xyz Clickhouse hostname* {ip} Clickhouse port* {port} Protocol ('clickhouses' for SSL, 'clickhouse' for no SSL, 'http' or 'https')* clickhouses

Enable SSL verification: tick SSL verification is enabled by default

Clickhouse database name database

-> Connect -> Connection Problem clickhouse scopes are invalid: { "connect_to_server": "Code: 209. ({ip}:{port)" }

Aug 19 '25 05:08 tmqu130

The error clickhouse scopes are invalid: { "connect_to_server": "Code: 209. ({ip}:{port)" } means Keep failed to connect to your ClickHouse server and execute a test query (SHOW TABLES). In ClickHouse, Code: 209 usually signals a network, authentication, or protocol mismatch during connection.

Here’s what to check:

Make sure the protocol (clickhouses) matches your server’s SSL setup and that you’re using the correct SSL port (typically 9440 for native SSL, not 8123 which is for HTTPS) setup reference.
The hostname should be just the address (no http:// or https:// prefix).
Double-check username, password, and database values.
If your server uses a self-signed certificate, try disabling SSL verification (verify: false) in the provider config to see if that resolves the issue (not recommended for production, but useful for testing).
Make sure Keep can reach the ClickHouse server over the network (no firewall or routing issues).
If you still get the error, check your ClickHouse server logs for more details on why the connection is refused or fails.

The provider passes back any connection error directly, so any mismatch in protocol, port, SSL, or credentials will show up as a connection failure like this code reference.

If you want to share your exact provider config (with sensitive info redacted), I can help spot any misconfigurations.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Aug 19 '25 05:08 dosubot[bot]

keep
keep copied to clipboard

[🐛 Bug]: Correlation & Incidents: Backend - Frontend mismatched

keep keep copied to clipboard

[🐛 Bug]: Correlation & Incidents: Backend - Frontend mismatched

keep
keep copied to clipboard