graylog2-server Certain Users-related API requests slow with large number of users

Certain Users-related API requests slow with large number of users

Open danotorrey opened this issue 2 years ago • 7 comments

Overview

When a large number of users are present in Graylog, certain user-related API requests seem to be slow and take a long time. The requests take longer when the number of users in Graylog increases.

The problem was initially reported and investigated in the customer support ticket HS842447159.

Details

In the support ticket, certain actions were reported by the user to be slow (taking more than 20 seconds):

Viewing the Create new Alert page (of type Email Notification), which loads all users from the database in the process of preparing a dropdown list with all users.
Sharing certain entities with specific users or team (using the Share button). For example sharing a stream using the Share button on the main streams page.
Any other request that hits the GET /api/users/ endpoint or userService.loadAll(), which loads all users from the database.

Testing

I was able to reproduce the issue by adding 1600 users to my Graylog database in my local development environment. When I attempt to create an email notification alert definition, the page freezes while the GET /api/users endpoint for about 60s.

Directly calling the API endpoint produces the same issue. Note that the endpoint is deprecated in favor of using pagination. https://github.com/Graylog2/graylog2-server/blob/bac3fe9f928b415603b9ed1e3d772de32232c8a1/graylog2-server/src/main/java/org/graylog2/rest/resources/users/UsersResource.java#L213

In my testing, sharing entities was not very slow, but I also don't think I am testing the same setup as the customer. They mentioned in the issue that share operation and the /api/authz/shares/entities/grn::::stream:000000000000000000000002/prepare endpoint was returning 1700 available grants. I believe this means that have likely shared the All Events stream a large number of times.

Root Cause

It appears that several areas of the Graylog users API and frontend use the userService.loadAll() method, which loads all users from MongoDB instead of just loading a page at a time, or searching for smaller subsets of users. This does not present a problem in smaller setups, but as the number of users increases, this can create a non-linear increase for the amount of time it takes for the requests to complete.

Possible Solution

Refactor the pages and API request that load API all users from the database to only load a page at a time, to search for smaller subsets of users, or to perform filtering at the database layer.

This will require backend and frontend work, since I believe the same issue exists in both locations.

Example 1: Email alert callback

Since the Email Alert definition page loads a full list of users, it looks like backend and frontend code loads all users. https://github.com/Graylog2/graylog2-server/blob/3e896ce0a8b9444fd4e1039ec0b3832c7435f12f/graylog2-server/src/main/java/org/graylog2/alarmcallbacks/EmailAlarmCallback.java#L219-L221

Example 2: Sharing entities with users

In the process of fulfilling this API requests, all users are loaded from the database: https://github.com/Graylog2/graylog2-server/blob/f35df42e165ac570b8b27de3f8eeac85e74ed610/graylog2-server/src/main/java/org/graylog/security/rest/EntitySharesResource.java#L111-L114 Down the call-chain, the actual all-users query is executed here: https://github.com/Graylog2/graylog2-server/blob/a8885f551b9be78166b357fee0ff4f792e9bf035/graylog2-server/src/main/java/org/graylog/security/shares/DefaultGranteeService.java#L67-L73 VIA this location https://github.com/Graylog2/graylog2-server/blob/56cc1b9da460a97b7f7956680cecfe27fb6512c4/graylog2-server/src/main/java/org/graylog/security/shares/EntitySharesService.java#L98

These were two examples I found while investigating the support ticket. But, I did not perform an extensive review or profiling to compile a full list.

Workaround

Aside from reducing the number of users present in Graylog, there is no other known workaround at this time.

Context

Your Environment

Graylog Version: 4.4 Snapshot
Java Version: Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Elasticsearch Version:
MongoDB Version:
Operating System: macOS Monterey 12.3.1
Browser version:

Apr 11 '22 19:04 danotorrey

We will need to work on optimizing the affected application paths. I will work on gathering a list of all areas that need to be updated and will document those here on this issue.

Apr 12 '22 12:04 danotorrey

@danotorrey this might be a workaround: https://github.com/Graylog2/graylog2-server/issues/11112

Apr 12 '22 15:04 mpfz0r

Permissions are the bottleneck, since for every user we need to evaluate roles and teams to determine the final set of permissions. Unless we cache this information, I don't see that this can be sped up much.

As mentioned above, we provide a system setting to prohibit sharing of entities with individual users. If you only share with teams, the sharing drop-down is generally much faster. From a system admin point of view, it's also preferable to dealing with a bunch of users individually. So we should guide users toward using this feature.

May 30 '22 09:05 patrickmann

@patrickmann if you were able to reproduce the issue, could you provide a few hints about how you achieved that? Or some more insights on where exactly you think the bottleneck is? Thank you :)

May 30 '22 09:05 mpfz0r

To reproduce the issue I created 3000 users via POST /api/user, each with a basic set of permissions that I copied from an existing user. Then I assigned all of the users to a single team via PUT /api/plugins/org.graylog.plugins.security/teams/{team-ID}/members. Finally I added 1000 user session entries to the DB via POST /api/system/sessions.

With this data set I observed the following:

get users with permissions and sessions: 16.44s
get users with permissions, no sessions: 15.34s
get users with sessions, no permissions: 1.06s
get users, no permissions, no sessions: 1.04s

The method that initiates all of the permissions gathering is getPermissionsForUser.

May 30 '22 10:05 patrickmann

@patrickmann @mpfz0r I think reducing the load time in the backend is not enough. If you have thousands of users and return all of them, the user interface will also become very slow. The JSON payload will become quite large, and also rendering all entries in the DOM is slow.

We need to implement an asynchronous way of fetching users for use cases where we must put them in a drop-down menu. The problem also affects other entities that we need to put in a drop-down list. (e.g., streams) Listing many entities on pages works because we already implement pagination and filtering.

For drop-down menus, we need pagination and filtering as well. Showing multiple pages in a drop-down is quite hard, of course.

One way of implementing it could be to only show a limited number of users by default in the drop-down menu. Let's say you have 5000 users in the database. Opening the drop-down would only show the first 50 or 100 entries. Users need to enter a filter criteria to get access to the other entries. We use something like that for the stream selection on the search page.

That way, we don't have to load all entities from the database all the time. We already have an API endpoint to return paginated and filtered user entries. We might be able to reuse that and only have to implement a new type of drop-down field - ideally a reusable one - that can use the API. (if we don't have something like that already)

What do you think?

May 30 '22 14:05 bernd

I agree with the solution suggested above by @bernd. I believe this is a common pattern for performantly handling collections of an indeterminate size.

I also think it is ok if multiple separate PRs are filed to fully resolve the issue. But, I do agree that to consider the issue resolved, we should implement this frontend solution. I also think there is still a considerable performance impact from loading all users. The enhancements already implemented in https://github.com/Graylog2/graylog2-server/pull/12743 should be quite helpful as well.

I checked, and did not see any existing user dropdown that utilize this pattern. So, it would appear that a new implementation will be needed. It would be great if we could create a more general component that could be used for dropdowns containing potentially large lists.

May 31 '22 21:05 danotorrey

Following up on status of this issue. @patrickmann

Feb 06 '23 14:02 BBruce-Graylog

@BBruce-Graylog Paginated queries have been implemented for simple user dropdowns (email recipients). However, this isn't applicable for dropdowns that are populated with complex data, in particular for sharing with users and teams.

@ousmaneo is currently evaluating whether previously applied optimisations are good enough. Sharing with individual users is discouraged for large number of users, so we don't want to invest a lot into improving that scenario.

Feb 07 '23 07:02 patrickmann

@patrickmann Ok, just let us know what is decided so we can update the customer since this issue has been open from 4/2022 with Support.

Feb 07 '23 14:02 BBruce-Graylog

Hi @BBruce-Graylog, I did a small test for the sharing feature to compare the previous implementation and the change we already introduced in Graylog to improve all select boxes in Graylog when the number of entities is very high (> 1000). This applies to sharing entities with users too.

Here, I'm testing with roughly 4000 users in the instance. Without the change I mentioned above, this is the result when trying to share a search with a user:

The Select takes some time to open.
Scrolling is very slow, and the select list hangs for a while.

Share-without-virtualize

We see significant improvement when we apply the changes that are virtualizing the Select when it hits 1000 users.

Select is faster to open
Scrolling is fast without lags

Share-with-virtualize

Let me know if you have any questions.

Feb 15 '23 08:02 ousmaneo

@BBruce-Graylog I am closing out this issue, as there is no further action required.

user dropdown list (in email notification) is redesigned and can handle arbitrary number of users
user and teams sharing list benefits from general UI improvement (as of 5.0); performance for reasonable number of users is good. For very large number of users, we recommend disabling of sharing with individual users.

Feb 15 '23 08:02 patrickmann

graylog2-server graylog2-server copied to clipboard

Certain Users-related API requests slow with large number of users

Overview

Details

Testing

Root Cause

Possible Solution

Example 1: Email alert callback

Example 2: Sharing entities with users

Workaround

Context

Your Environment

graylog2-server
graylog2-server copied to clipboard