DSpace Optimize SQL query in findByEPerson to enhance performance

Description

It updates the findByEPerson method in the GroupDAOImpl class to improve SQL query performance. The previous HQL statement has been rewritten to use a more efficient JOIN query when retrieving groups linked to a specific EPerson.

This issue was discovered recently during a period of high user activity in an instance with a large number of users and groups, where multiple concurrent connections were observed. While monitoring the database, I noticed many active and slow-running queries similar to the following:

select g1_0.uuid, g1_1.uuid, g1_0.eperson_group_id, g1_0.name, g1_0.permanent
from public.epersongroup g1_0
join public.dspaceobject g1_1 on g1_0.uuid = g1_1.uuid
where (
    select e1_0.uuid
    from public.eperson e1_0
    join public.dspaceobject e1_1 on e1_0.uuid = e1_1.uuid
    where e1_0.uuid = $1
) in (
    select e2_0.eperson_id
    from public.epersongroup2eperson e2_0
    join (public.eperson e2_1
          join public.dspaceobject e2_2 on e2_1.uuid = e2_2.uuid)
      on e2_1.uuid = e2_0.eperson_id
    where g1_0.uuid = e2_0.eperson_group_id
)

These subquery-based statements caused unnecessary overhead and contributed to poor performance during peak load.

Instructions for Reviewers

List of changes in this PR:

Replaced the old subquery-based HQL:

from Group where (from EPerson e where e.id = :eperson_id) in elements(epeople)

with a more performant join-based query:

select distinct g from Group g join g.epeople ep where ep.id = :eperson_id

Checklist

[x] My PR is created against the main branch of code (unless it is a backport or is fixing an issue specific to an older branch).
[x] My PR is small in size (e.g. less than 1,000 lines of code, not including comments & integration tests). Exceptions may be made if previously agreed upon.
[x] My PR passes Checkstyle validation based on the Code Style Guide.
[x] My PR includes Javadoc for all new (or modified) public methods and classes. It also includes Javadoc for large or complex private methods.
[ ] My PR passes all tests and includes new/updated Unit or Integration Tests based on the Code Testing Guide.
[ ] My PR includes details on how to test it. I've provided clear instructions to reviewers on how to successfully test this fix or feature.
[x] If my PR includes new libraries/dependencies (in any pom.xml), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.
[x] If my PR modifies REST API endpoints, I've opened a separate REST Contract PR related to this change.
[x] If my PR includes new configurations, I've provided basic technical documentation in the PR itself.
[x] If my PR fixes an issue ticket, I've linked them together.

Oct 29 '25 14:10 toniprieto

Thanks @toniprieto. This sounds reasonable. Is this testable in any way? We also have a large number of users and groups, so it would be interesting if we could quantify the before and after. I don't know if it's possible to run HQL directly...

Oct 30 '25 06:10 alanorth

Great @alanorth ! I was just looking into the best way to reproduce this for testing. By enabling Hibernate’s show_sql option, I can see that the new query being executed is:

select distinct g1_0.uuid, g1_1.uuid, g1_0.eperson_group_id, g1_0.name, g1_0.permanent 
from public.epersongroup g1_0 
join public.dspaceobject g1_1 on g1_0.uuid = g1_1.uuid 
join public.epersongroup2eperson e1_0 on g1_0.uuid = e1_0.eperson_group_id 
where e1_0.eperson_id = ?

The current query is the one present in the ticket description. Using the UUID of a user who belongs to multiple groups, you should be able to test it.

This function is usually called before running a Solr search with a non-admin user. In a test instance, it’s hard to notice a big improvement because the query tends to be cached. However, I’ve observed a clearer improvement when multiple users are logging in and the system is under higher load.

Oct 30 '25 14:10 toniprieto

@toniprieto thanks. I tested both queries with a user who belongs to 307 groups and the result set was the same, though I don't think it's easy to test the performance difference.

Nov 04 '25 06:11 alanorth