graylog2-server
graylog2-server copied to clipboard
Certain Users-related API requests slow with large number of users
Overview
When a large number of users are present in Graylog, certain user-related API requests seem to be slow and take a long time. The requests take longer when the number of users in Graylog increases.
The problem was initially reported and investigated in the customer support ticket HS842447159.
Details
In the support ticket, certain actions were reported by the user to be slow (taking more than 20 seconds):
- Viewing the Create new Alert page (of type Email Notification), which loads all users from the database in the process of preparing a dropdown list with all users.
- Sharing certain entities with specific users or team (using the Share button). For example sharing a stream using the Share button on the main streams page.
- Any other request that hits the
GET /api/users/
endpoint oruserService.loadAll()
, which loads all users from the database.
Testing
I was able to reproduce the issue by adding 1600 users to my Graylog database in my local development environment. When I attempt to create an email notification alert definition, the page freezes while the GET /api/users
endpoint for about 60s.
Directly calling the API endpoint produces the same issue. Note that the endpoint is deprecated in favor of using pagination. https://github.com/Graylog2/graylog2-server/blob/bac3fe9f928b415603b9ed1e3d772de32232c8a1/graylog2-server/src/main/java/org/graylog2/rest/resources/users/UsersResource.java#L213
In my testing, sharing entities was not very slow, but I also don't think I am testing the same setup as the customer. They mentioned in the issue that share operation and the /api/authz/shares/entities/grn::::stream:000000000000000000000002/prepare
endpoint was returning 1700 available grants. I believe this means that have likely shared the All Events
stream a large number of times.
Root Cause
It appears that several areas of the Graylog users API and frontend use the userService.loadAll()
method, which loads all users from MongoDB instead of just loading a page at a time, or searching for smaller subsets of users. This does not present a problem in smaller setups, but as the number of users increases, this can create a non-linear increase for the amount of time it takes for the requests to complete.
Possible Solution
Refactor the pages and API request that load API all users from the database to only load a page at a time, to search for smaller subsets of users, or to perform filtering at the database layer.
This will require backend and frontend work, since I believe the same issue exists in both locations.
Example 1: Email alert callback
Since the Email Alert definition page loads a full list of users, it looks like backend and frontend code loads all users. https://github.com/Graylog2/graylog2-server/blob/3e896ce0a8b9444fd4e1039ec0b3832c7435f12f/graylog2-server/src/main/java/org/graylog2/alarmcallbacks/EmailAlarmCallback.java#L219-L221
Example 2: Sharing entities with users
In the process of fulfilling this API requests, all users are loaded from the database: https://github.com/Graylog2/graylog2-server/blob/f35df42e165ac570b8b27de3f8eeac85e74ed610/graylog2-server/src/main/java/org/graylog/security/rest/EntitySharesResource.java#L111-L114 Down the call-chain, the actual all-users query is executed here: https://github.com/Graylog2/graylog2-server/blob/a8885f551b9be78166b357fee0ff4f792e9bf035/graylog2-server/src/main/java/org/graylog/security/shares/DefaultGranteeService.java#L67-L73 VIA this location https://github.com/Graylog2/graylog2-server/blob/56cc1b9da460a97b7f7956680cecfe27fb6512c4/graylog2-server/src/main/java/org/graylog/security/shares/EntitySharesService.java#L98
These were two examples I found while investigating the support ticket. But, I did not perform an extensive review or profiling to compile a full list.
Workaround
Aside from reducing the number of users present in Graylog, there is no other known workaround at this time.
Context
Your Environment
- Graylog Version: 4.4 Snapshot
- Java Version: Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
- Elasticsearch Version:
- MongoDB Version:
- Operating System: macOS Monterey 12.3.1
- Browser version:
We will need to work on optimizing the affected application paths. I will work on gathering a list of all areas that need to be updated and will document those here on this issue.
@danotorrey this might be a workaround: https://github.com/Graylog2/graylog2-server/issues/11112
Permissions are the bottleneck, since for every user we need to evaluate roles and teams to determine the final set of permissions. Unless we cache this information, I don't see that this can be sped up much.
As mentioned above, we provide a system setting to prohibit sharing of entities with individual users. If you only share with teams, the sharing drop-down is generally much faster. From a system admin point of view, it's also preferable to dealing with a bunch of users individually. So we should guide users toward using this feature.
@patrickmann if you were able to reproduce the issue, could you provide a few hints about how you achieved that? Or some more insights on where exactly you think the bottleneck is? Thank you :)
To reproduce the issue I created 3000 users via POST /api/user
, each with a basic set of permissions that I copied from an existing user. Then I assigned all of the users to a single team via PUT /api/plugins/org.graylog.plugins.security/teams/{team-ID}/members
. Finally I added 1000 user session entries to the DB via POST /api/system/sessions
.
With this data set I observed the following:
- get users with permissions and sessions: 16.44s
- get users with permissions, no sessions: 15.34s
- get users with sessions, no permissions: 1.06s
- get users, no permissions, no sessions: 1.04s
The method that initiates all of the permissions gathering is getPermissionsForUser.
@patrickmann @mpfz0r I think reducing the load time in the backend is not enough. If you have thousands of users and return all of them, the user interface will also become very slow. The JSON payload will become quite large, and also rendering all entries in the DOM is slow.
We need to implement an asynchronous way of fetching users for use cases where we must put them in a drop-down menu. The problem also affects other entities that we need to put in a drop-down list. (e.g., streams) Listing many entities on pages works because we already implement pagination and filtering.
For drop-down menus, we need pagination and filtering as well. Showing multiple pages in a drop-down is quite hard, of course.
One way of implementing it could be to only show a limited number of users by default in the drop-down menu. Let's say you have 5000 users in the database. Opening the drop-down would only show the first 50 or 100 entries. Users need to enter a filter criteria to get access to the other entries. We use something like that for the stream selection on the search page.
That way, we don't have to load all entities from the database all the time. We already have an API endpoint to return paginated and filtered user entries. We might be able to reuse that and only have to implement a new type of drop-down field - ideally a reusable one - that can use the API. (if we don't have something like that already)
What do you think?
I agree with the solution suggested above by @bernd. I believe this is a common pattern for performantly handling collections of an indeterminate size.
I also think it is ok if multiple separate PRs are filed to fully resolve the issue. But, I do agree that to consider the issue resolved, we should implement this frontend solution. I also think there is still a considerable performance impact from loading all users. The enhancements already implemented in https://github.com/Graylog2/graylog2-server/pull/12743 should be quite helpful as well.
I checked, and did not see any existing user dropdown that utilize this pattern. So, it would appear that a new implementation will be needed. It would be great if we could create a more general component that could be used for dropdowns containing potentially large lists.
Following up on status of this issue. @patrickmann
@BBruce-Graylog Paginated queries have been implemented for simple user dropdowns (email recipients). However, this isn't applicable for dropdowns that are populated with complex data, in particular for sharing with users and teams.
@ousmaneo is currently evaluating whether previously applied optimisations are good enough. Sharing with individual users is discouraged for large number of users, so we don't want to invest a lot into improving that scenario.
@patrickmann Ok, just let us know what is decided so we can update the customer since this issue has been open from 4/2022 with Support.
Hi @BBruce-Graylog, I did a small test for the sharing feature to compare the previous implementation and the change we already introduced in Graylog to improve all select boxes in Graylog when the number of entities is very high (> 1000). This applies to sharing entities with users too.
Here, I'm testing with roughly 4000 users in the instance. Without the change I mentioned above, this is the result when trying to share a search with a user:
- The Select takes some time to open.
- Scrolling is very slow, and the select list hangs for a while.
We see significant improvement when we apply the changes that are virtualizing the Select when it hits 1000 users.
- Select is faster to open
- Scrolling is fast without lags
Let me know if you have any questions.
@BBruce-Graylog I am closing out this issue, as there is no further action required.
- user dropdown list (in email notification) is redesigned and can handle arbitrary number of users
- user and teams sharing list benefits from general UI improvement (as of 5.0); performance for reasonable number of users is good. For very large number of users, we recommend disabling of sharing with individual users.