OpenSearch-Dashboards
OpenSearch-Dashboards copied to clipboard
[BUG] Per Document Monitor
Describe the bug
On updating the Query section for "per document monitor" I have observed multiple events "_execute" being submitted which lead to high CPU consumption that crashed my openSearch instance.
To Reproduce Steps to reproduce the behavior:
- Go to 'Alerting > Monitors > Create Monitor'
- Click on 'Per document monitor'
- Select your Index 'opensearch_dashboards_sample_data_flights'
- On the Query section :
- Query name : x
- dayOfWeek is 5
Expected behavior
Results based on the query provided...
OpenSearch Version v 2.17.0
Dashboards Version v 2.17.0
Plugins
Please list all plugins currently enabled.
None/Default
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: Linux
- Browser and version : Chrome 131.0.6778.86
Additional context
Add any other context about the problem here.
Any updates on this similar issue with aws opensearch as well
Experiencing the same behavior (with both 2.15 and 2.17 AWS OpenSearch). Editing the Per document Monitor (after entering the query) causes high CPU load:
and JVM memory pressure:
which lasts typically around 4 hours.
During this period, the OpenSearch JVM is busy mostly with GC as the log is full of messages like: [gc][38393] overhead, spent [1.2s] collecting in the last [1.6s]
Hi guys.
Same problem here :(.
OpenSearch Version: 2.19.1 Dashboards Version: 2.19.1
If user attempting to make the alerting based on per document search, everything is smooth, until user attempting to try his query and click to "Preview query and performance". After some time, in our case, he finished with error message, that search query is longer that 30s.
When I try the search query in Discover view, I getting result after 1-2s, so, searching query definitely isnt the problem.
Unfortunately, it seems, but I am not sure, that the alerting plugin trying create new searching tasks. Consequently, after several time, OpenSearch data node crashing. Looks like:
Restart of the affected data node isnt helpful, because the tasks cluster attempting to run another data node.
This spawning new tasks continue, until is index with tenant data dropped. No matter, that the alert was deleted - deleting problematic alert isnt helpful. Problematic uncancellable tasks looks like:
$ curl -ks -uosadmin:$ospass "https://localhost:9200/_cat/tasks?v"
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:admin/opensearch/alerting/monitor/doclevel/fanout rRgsqmeLTKCouhz7Jau70w:693305 - transport 1748259416574 11:36:56 56.2m 100.96.33.62 ofd-data-0
cluster:admin/opensearch/alerting/monitor/doclevel/fanout rRgsqmeLTKCouhz7Jau70w:693387 - transport 1748259416705 11:36:56 56.2m 100.96.33.62 ofd-data-0
cluster:admin/opensearch/alerting/monitor/doclevel/fanout rRgsqmeLTKCouhz7Jau70w:811414 - transport 1748259841941 11:44:01 49.2m 100.96.33.62 ofd-data-0
cluster:admin/opensearch/alerting/monitor/doclevel/fanout rRgsqmeLTKCouhz7Jau70w:811444 - transport 1748259842064 11:44:02 49.2m 100.96.33.62 ofd-data-0
After tenant index is dropped, new tasks arent spawned, but previous stay active. Helpful was only attempting to cancel them:
$ curl -k -uosadmin:$ospass -X POST "https://localhost:9200/_tasks/rRgsqmeLTKCouhz7Jau70w:811444/_cancel"
{"node_failures":[{"type":"failed_node_exception","reason":"Failed node [rRgsqmeLTKCouhz7Jau70w]","node_id":"rRgsqmeLTKCouhz7Jau70w","caused_by":{"type":"illegal_argument_exception","reason":"task [rRgsqmeLTKCouhz7Jau70w:811444] doesn't support cancellation"}}],"nodes":{}}
AND restart affected node, which attempting to executed them.
It is very unpleasant, when user working with GUI can very easly overload whole cluster!!!
Have you please any update?