OpenSearch-Dashboards [BUG] Per Document Monitor

Describe the bug

On updating the Query section for "per document monitor" I have observed multiple events "_execute" being submitted which lead to high CPU consumption that crashed my openSearch instance.

To Reproduce Steps to reproduce the behavior:

Go to 'Alerting > Monitors > Create Monitor'
Click on 'Per document monitor'
Select your Index 'opensearch_dashboards_sample_data_flights'
On the Query section :

Query name : x
dayOfWeek is 5

Expected behavior

Results based on the query provided...

OpenSearch Version v 2.17.0

Dashboards Version v 2.17.0

Plugins

Please list all plugins currently enabled.

None/Default

Screenshots

If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: Linux
Browser and version : Chrome 131.0.6778.86

Additional context

Add any other context about the problem here.

Dec 04 '24 12:12 Frans-Segooa

[Catch All Triage - 1, 2, 3, 4, 5, 6]

Jan 06 '25 17:01 dblock

Any updates on this similar issue with aws opensearch as well

Jan 13 '25 16:01 shukla2009

Experiencing the same behavior (with both 2.15 and 2.17 AWS OpenSearch). Editing the Per document Monitor (after entering the query) causes high CPU load:

and JVM memory pressure:

which lasts typically around 4 hours.

During this period, the OpenSearch JVM is busy mostly with GC as the log is full of messages like: [gc][38393] overhead, spent [1.2s] collecting in the last [1.6s]

Feb 25 '25 09:02 pavsindelar

Hi guys.

Same problem here :(.

OpenSearch Version: 2.19.1 Dashboards Version: 2.19.1

If user attempting to make the alerting based on per document search, everything is smooth, until user attempting to try his query and click to "Preview query and performance". After some time, in our case, he finished with error message, that search query is longer that 30s.

When I try the search query in Discover view, I getting result after 1-2s, so, searching query definitely isnt the problem.

Unfortunately, it seems, but I am not sure, that the alerting plugin trying create new searching tasks. Consequently, after several time, OpenSearch data node crashing. Looks like:

Restart of the affected data node isnt helpful, because the tasks cluster attempting to run another data node.

This spawning new tasks continue, until is index with tenant data dropped. No matter, that the alert was deleted - deleting problematic alert isnt helpful. Problematic uncancellable tasks looks like:

$ curl -ks -uosadmin:$ospass "https://localhost:9200/_cat/tasks?v"
action                                                    task_id                         parent_task_id                  type      start_time    timestamp running_time ip            node
cluster:admin/opensearch/alerting/monitor/doclevel/fanout rRgsqmeLTKCouhz7Jau70w:693305   -                               transport 1748259416574 11:36:56  56.2m        100.96.33.62  ofd-data-0
cluster:admin/opensearch/alerting/monitor/doclevel/fanout rRgsqmeLTKCouhz7Jau70w:693387   -                               transport 1748259416705 11:36:56  56.2m        100.96.33.62  ofd-data-0
cluster:admin/opensearch/alerting/monitor/doclevel/fanout rRgsqmeLTKCouhz7Jau70w:811414   -                               transport 1748259841941 11:44:01  49.2m        100.96.33.62  ofd-data-0
cluster:admin/opensearch/alerting/monitor/doclevel/fanout rRgsqmeLTKCouhz7Jau70w:811444   -                               transport 1748259842064 11:44:02  49.2m        100.96.33.62  ofd-data-0

After tenant index is dropped, new tasks arent spawned, but previous stay active. Helpful was only attempting to cancel them:

$ curl -k -uosadmin:$ospass -X POST "https://localhost:9200/_tasks/rRgsqmeLTKCouhz7Jau70w:811444/_cancel"
{"node_failures":[{"type":"failed_node_exception","reason":"Failed node [rRgsqmeLTKCouhz7Jau70w]","node_id":"rRgsqmeLTKCouhz7Jau70w","caused_by":{"type":"illegal_argument_exception","reason":"task [rRgsqmeLTKCouhz7Jau70w:811444] doesn't support cancellation"}}],"nodes":{}}

AND restart affected node, which attempting to executed them.

It is very unpleasant, when user working with GUI can very easly overload whole cluster!!!

Have you please any update?

May 26 '25 14:05 LHozzan