LDAP user sync is really slow
Describe the bug The LDAP user sync takes a really long time, about 1.3 users/s. This seems like a really long time and in total the whole sync process takes more than 2 hours. Syncing the groups and memberships is lightning fast in comparison.
To Reproduce Steps to reproduce the behavior:
- Set up an LDAP source with many users
- Start a sync (and potentially increase the timeout)
- Check the resulting duration in the system task overview
Expected behavior Other users report progress being faster with e.g. 9 user/s in https://github.com/goauthentik/authentik/issues/6929#issue-1901008944. While this is still quite slow it is a lot an order of magnitude better than what we are currently getting.
Screenshots
Logs
Version and Deployment (please complete the following information):
- authentik version: 2024.2.2
- Deployment: docker-compose
Additional context System utilization (CPU and IO) is normal during the sync and does not seem to be the bottleneck.
The initial sync is definitely slower than subsequent syncs, however you can also speed this up by scaling up the amount of workers you are running, as the sync gets parallelised across them.
This was not the first sync though for some reason Authentik still displays "Not synced yet." on the source overview. Could that somehow be the culprit?
Regarding multiple workers there is https://github.com/goauthentik/authentik/issues/6929#issuecomment-1723632165 where you mentioned #6815 which is still open so my assumption is that it does not yet work as expected?
There is a workaround for that, with which I still need to update that issue with. You can run worker containers with -b to not run the scheduled tasks (so 1 worker with no args set and N workers with -b set), and with that you won't run into those issues. Aside from that
Additionally with https://github.com/goauthentik/authentik/commit/f728bbb14b3a8abb6d2d67da69e4291d5e0a83da#diff-17304f637c355091282495601690ebf8e379affb23f8d3afe43f0ff230d1318bR194 there's a lock on LDAP syncs now, so that if one runs, other syncs can't start
Still, it feels like almost a second per user is a really long time. I wonder what could lead to this slowdown.
There is a workaround for that, with which I still need to update that issue with. You can run worker containers with
-bto not run the scheduled tasks (so 1 worker with no args set and N workers with-bset), and with that you won't run into those issues.
The LDAP server would have to support pagination for that to work, right? Our LDAP server does not support pagination, sadly
Ah yes, the scaling for LDAP sync does require pagination (which also explains why you're only seeing 1 task instead of 1 task for each 100 or so objects)
Part of the reason for the speed is that when authentik can't distribute pages over different workers, all users/groups are iterated through serially, property mapping values are computed, then authentik tries to create/update the user and lastly there are some checks for vendor-specific quirks
Ah yes, the scaling for LDAP sync does require pagination (which also explains why you're only seeing 1 task instead of 1 task for each 100 or so objects)
Would solely the support of paging already make a difference? Can a single worker already handle several tasks concurrently?
Part of the reason for the speed is that when authentik can't distribute pages over different workers, all users/groups are iterated through serially, property mapping values are computed, then authentik tries to create/update the user and lastly there are some checks for vendor-specific quirks
I already took a look at the code but could not find any reason why it would be that slow. Retrieving the users is rather quick, even without pagination our server returns all users within a few seconds. And the property mappings as well as the FreeIPA/AD check seem to only be simple attribute checks on a dictionary, so nothing which Python wouldn't be able to handle thousands of in a second.
My guess would have been that are some hidden database calls which are duplicated for every user even though they might be reusable, or that the expression evaluation is god awful slow and has to be parsed/compiled etc. anew for each user. When I have some spare time on my hand I will try to do the same setup but remove basically all mappings and when I got even more spare time I might try to set up pg_stat_statements to check in with the former. Regardless, I would find it hard to believe if there is not an avoidable performance issue present in the current code.
Yeah that is true, looking through the code, fetching the property mappings is not cached and is done for each object. Similarly, compiling the python in the expression is also not cached.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This doesn't seem to be fixed. Can it be reopened?