graylog2-server
graylog2-server copied to clipboard
Improve handling of long running search queries on search/dashboards.
Why?
Searches/dashboards querying large amounts of data can put significant load on the OpenSearch cluster. The current way of retrieving search results for these in the frontend is an all-or-nothing approach, where all of the data for the current search/dashboard tab is retrieved and results are shown only when it has completed. This leads to:
- Users having to wait for the widget with the maximum query time until any result is displayed
- Users having the impression the search is "stuck", retriggering another one, making the situation worse
What?
For the reasons above, we should take care that:
- The wait time for a user is minimized, by:
- Considering caching for long time ranges (within the margin of a time proportional to the length of the time range queried, e.g. a relative time range for
last 7 days -> now
returns results from cache if queried within five minutes) - Considering returning partial results (widgets finishing early will already be shown) through asynchronous search
- Considering caching for long time ranges (within the margin of a time proportional to the length of the time range queried, e.g. a relative time range for
- The amount of executed searches is minimized, by:
- Making sure that searches are cancelled, when the users triggers another search while the current one is running
- Making sure that searches are cancelled when they are taking too long (
cancel_after_time_interval
)
- The amount of data being queried is minimized, by:
- Only retrieving results and rendering widgets which are currently in viewport, lazy-loading the rest
- Preventing the user from refreshing the page by providing feedback when searches take longer, so they know it is not stuck
- Any other ideas for the points above
Your Environment
- Graylog Version:
- OpenSearch Version:
- MongoDB Version:
- Operating System:
- Browser version:
Some further investigations and decisions can be found here: https://graylogdocumentation.atlassian.net/wiki/spaces/SEARCH/pages/3138191452/Long+running+search+handling
Another option worth considering, found by Jan, already considered in data-node : #18590
Lets have a talk on how we want to proceed with the remaining ideas.
We implemented:
- The amount of executed searches is minimized, by:
- Making sure that searches are cancelled, when the users triggers another search while the current one is running
- Making sure that searches are cancelled when they are taking too long (`cancel_after_time_interval`)
- Preventing the user from refreshing the page by providing feedback when searches take longer, so they know it is not stuck
We considered these the most rewarding points. From now on, we will wait for future feedback to see if we need to implement further measures. Therefore, and because we would need to reformulate future steps, I am closing this issue as completed.