SORMAS-Project icon indicating copy to clipboard operation
SORMAS-Project copied to clipboard

Improve performance of PersonFacadeEjb.getIndexList [5]

Open syntakker opened this issue 2 years ago • 1 comments

Problem Description

This summarizes the problem for getIndexList methods in general, not only for Persons.

For large datasets, some of the getIndexList methods are slow, performance should be improved here.

Dataset:

  • 1225343 persons
  • 85677 cases
  • 114782 contacts
  • 236583 tasks
  • 2600890 immunizations
  • 1991 vaccinations
  • 15693 samples
  • 11846 events
  • 534 eventparticipants

Result of a performance analysis: image

The runtime is mainly due to long running SQL queries, optimization is needed here.

Some preliminary insights:

  • in the following query from PersonFacadeEjb.getIndexList, runtime is reduced by facto 0.5 when omitting the joins on district, subselect and case statement. Since the query is limited to 100 result, a viable solution might be to select Ids /Uuids first and fill the remaining data after for the returned entities. This, however, with respect to the properties possibly being used a sorting criteria.
    select
        distinct person0_.uuid as col_0_0_,
        person0_.firstName as col_1_0_,
        person0_.lastName as col_2_0_,
        person0_.approximateAge as col_3_0_,
        person0_.approximateAgeType as col_4_0_,
        person0_.birthdate_dd as col_5_0_,
        person0_.birthdate_mm as col_6_0_,
        person0_.birthdate_yyyy as col_7_0_,
        person0_.sex as col_8_0_,
        district2_.name as col_9_0_,
        location1_.street as col_10_0_,
        location1_.houseNumber as col_11_0_,
        location1_.postalCode as col_12_0_,
        location1_.city as col_13_0_,
        (select
            personcont12_.contactInformation 
        from
            PersonContactDetail personcont12_ 
        where
            personcont12_.person_id=person0_.id 
            and personcont12_.primaryContact=true 
            and personcont12_.personContactDetailType='PHONE') as col_14_0_,
        (select
            personcont13_.contactInformation 
        from
            PersonContactDetail personcont13_ 
        where
            personcont13_.person_id=person0_.id 
            and personcont13_.primaryContact=true 
            and personcont13_.personContactDetailType='EMAIL') as col_15_0_,
        person0_.changeDate as col_16_0_,
        case 
            when 1=1 
            or (cases5_.reportingUser_id is not null) 
            and cases5_.reportingUser_id=8113961 
            or 1=1 
            or (contacts6_.reportingUser_id is not null) 
            and contacts6_.reportingUser_id=8113961 
            or 1=1 
            or 1=1 
            or (eventparti8_.reportingUser_id is not null) 
            and eventparti8_.reportingUser_id=8113961 
            or 1=1 
            or (travelentr9_.reportingUser_id is not null) 
            and travelentr9_.reportingUser_id=8113961 
            or 1=1 
            or (immunizati11_.reportingUser_id is not null) 
            and immunizati11_.reportingUser_id=8113961 
            or 1=1 then true 
            else false 
        end as col_17_0_
    from
        Person person0_ 
    left outer join
        Location location1_ 
            on person0_.address_id=location1_.id 
    left outer join
        District district2_ 
            on location1_.district_id=district2_.id 
    left outer join
        Region region3_ 
            on location1_.region_id=region3_.id 
    left outer join
        Community community4_ 
            on location1_.community_id=community4_.id 
    inner join
        cases cases5_ 
            on person0_.id=cases5_.person_id 
    left outer join
        contact contacts6_ 
            on person0_.id=contacts6_.person_id 
    left outer join
        cases case7_ 
            on contacts6_.caze_id=case7_.id 
    left outer join
        EventParticipant eventparti8_ 
            on person0_.id=eventparti8_.person_id 
    left outer join
        travelentry travelentr9_ 
            on person0_.id=travelentr9_.person_id 
    left outer join
        cases case10_ 
            on travelentr9_.resultingCase_id=case10_.id 
    left outer join
        immunization immunizati11_ 
            on person0_.id=immunizati11_.person_id 
    where
        cases5_.deleted=false 
    order by
        person0_.changeDate desc limit 100;
  • for DirectoryImmunizationService.getIndexList a similarly drastic reduction of runtimes can be observed by ommiting joins (e.g., district)
  • it is notable that getIndexList methods are performing reasonably for CaseFacadeEjb and ContactFacadeEjb, which deserves a closer look when working on PersonFacadeEjb.getIndexList, TaskFacadeEjb.getIndexList, DirectoryImmunizationService.getIndexList and the like
  • adding database indices in relevant places should be considered

Proposed Change

We will start by improving the performance of the PersonFacadeEjb.getIndexList method. If that works out, we should create separate issues to improve the performance for cases, contacts, events, event participants, samples, tasks, immunizations and travel entries (so basically for every entity that has a directory view).

#8946 has proven that by splitting the query into fetching the sorted ids first and afterwards doing another query to get the needed data, we can improve the performance a lot, because the database does not have to fetch and sort the whole table to then only return a limited number of results.

Before starting this make sure to get a performance test database and

  • [ ] Create an explain analyze of the current query execute by getIndexList following this guide (you can also get the SQL by enabling the output in the persistence.xml). The annoying part here is to replace the params

The current implementation then should be changed as follows:

  • [ ] 1. Rewrite PersonFacadeEjb.getIndexList to first fetch the needed ids (see pattern in PersonService.getAllAfter), then fetch the IndexDto by id with IN-clause. https://github.com/hzi-braunschweig/SORMAS-Project/issues/8946
  • [ ] 2. Add indices for sorting by first name and by last name. Adding indices for sorting by other fields probably doesn't make a lot of sense.

Finally:

  • [ ] Measure the query performance again (this time it will obviously be two queries).
  • [ ] If this works out as expected get in contact with @StefanKock or @MartinWahnschaffe to refine #8938 for the other entities.

Acceptance Criteria

  • PersonFacadeEjb.getIndexList returns the same results as before
  • The performance of the method is improved

Implementation Details

Additional Information

syntakker avatar Apr 25 '22 06:04 syntakker

Further instrumentation shows that pseudonymization, suspected to be responsible for slow response, can likely be ruled out as a relevant cause. This is also confirmed by the analysis of #9193, where large numbers of entities are pseudonymized. Here, processing of 85267 case entities by DtoPseudonymizer.pseudonymizeDtoCollection does contribute ~18 sec of runtime, which however seems acceptable for the number of entities and contrasts to an overall runtime of > 10 min for the overall method.

To be found, however, are long running SQL queries with massive joins.

For ImmunizationFacadeEjb.getIndexList: image

SQL query:

    select
        distinct directoryi0_.uuid as col_0_0_,
        person1_.uuid as col_1_0_,
        person1_.firstName as col_2_0_,
        person1_.lastName as col_3_0_,
        directoryi0_.disease as col_4_0_,
        person1_.approximateAge as col_5_0_,
        person1_.approximateAgeType as col_6_0_,
        person1_.birthdate_dd as col_7_0_,
        person1_.birthdate_mm as col_8_0_,
        person1_.birthdate_yyyy as col_9_0_,
        person1_.sex as col_10_0_,
        district3_.name as col_11_0_,
        directoryi0_.meansOfImmunization as col_12_0_,
        directoryi0_.immunizationManagementStatus as col_13_0_,
        directoryi0_.immunizationStatus as col_14_0_,
        directoryi0_.startDate as col_15_0_,
        directoryi0_.endDate as col_16_0_,
        lastvaccin4_.vaccineType as col_17_0_,
        directoryi0_.recoveryDate as col_18_0_,
        case 
            when (directoryi0_.reportingUser_id is not null) 
            and directoryi0_.reportingUser_id=8113961 
            or 1=1 then true 
            else false 
        end as col_19_0_,
        directoryi0_.changeDate as col_20_0_ 
    from
        immunization directoryi0_ 
    left outer join
        Person person1_ 
            on directoryi0_.person_id=person1_.id 
    left outer join
        Location location2_ 
            on person1_.address_id=location2_.id 
    left outer join
        District district3_ 
            on location2_.district_id=district3_.id 
    left outer join
        (
            SELECT
                v.immunization_id,
                vaccinetype 
            FROM
                vaccination v 
            INNER JOIN
                (
                    SELECT
                        immunization_id,
                        MAX(vaccinationdate) maxdate 
                    FROM
                        vaccination 
                    GROUP BY
                        immunization_id
                ) maxdates 
                    ON v.immunization_id=maxdates.immunization_id 
                    AND v.vaccinationdate=maxdates.maxdate 
                ) lastvaccin4_ 
                    on directoryi0_.id=lastvaccin4_.immunization_id 
            left outer join
                Region region5_ 
                    on directoryi0_.responsibleRegion_id=region5_.id 
            left outer join
                District district6_ 
                    on directoryi0_.responsibleDistrict_id=district6_.id 
            left outer join
                Community community7_ 
                    on directoryi0_.responsibleCommunity_id=community7_.id 
            left outer join
                Facility facility8_ 
                    on directoryi0_.healthFacility_id=facility8_.id 
            where
                (
                    (
                        directoryi0_.reportingUser_id is not null
                    ) 
                    and directoryi0_.reportingUser_id=8113961 
                    or 1=1
                ) 
                and directoryi0_.deleted=false 
            order by
                directoryi0_.changeDate desc limit 100

syntakker avatar May 16 '22 06:05 syntakker

To enhance the analysis of SQL queries, the script SORMAS-Project/sormas-backend/src/test/resources/performance/sql_log.sh is used to capture logging from the SORMAS Postgres container, including automated parameter replacement and query prefixing with EXPLAIN.

Results for long running queries: explainSql.zip

  • preloading of ids, as proposed in the issue description, is not feasible, as entity attributes have to be available for sorting
  • a relevant cost factor is sorting
  • for PersonFacadeEjb.getIndexList, CaseFacadeEjb.getIndexPage and ContactFacadeEjb.getIndexPage: an improvement is achieved by replacing subqueries for phone and email by joins on PersonContactDetail (for PersonFacadeEjb.getIndexList the main query's time was reduced from 4s 195ms to 1s 578ms)
  • for ImmunizationFacadeEjb.getIndexList: the dataset used for analysis has 2,600,890 entries for immunizations. These are propagated over several joins without reducing the result that take around 1s 30ms each. A solution is not obvious here, since the result set is only reduced after sorting with limit, for which attributes from the joined tables may be necessary. The poor performance seems also to be connected to the use of subselects in the classes FirstVaccinationDate, LastVaccineType and LastVaccinationDate which are attributes of DirectoryImmunization
  • in several places postgres uses a Sequence Scan where an index scan might perform better, e.g., when joining events and location for sample counts (see files explain_samples_count*) or eventparticipant and person for he samples index list (see explain_samples_count.*). Indices on foreign keys are present, the reason for the use of Sequence Scans in these places is subject to further investigation.

syntakker avatar Nov 02 '22 07:11 syntakker

@syntakker and I discussed his first approach (PR #10608).

  1. He found that fetching for phone number and email address is causing delays and can probably be improved. But this was not in scope of this ticket.
  2. It was not in scope of this ticket to touch other CoreAdo queries, only work on Person as proof of concept.
  3. The intention of the ticket was to avoid the costly DISTINCT on all selected attributes. We discussed a possible approach with the following two step select.
-- OLD 
-- sorted by changeDate but possibly other attributes given by PersonCriteria
-- Problem: DISTINCT requires sorting on all selected attributes
SELECT DISTINCT * FROM ...(complex generated query) ORDER BY changeDate LIMIT 100 OFFSET 1000;

-- NEW
-- 1. Find person ids of the persons in the page to show (adhere the sorting).
SELECT DISTINCT person.id FROM (SELECT * FROM ...(complex generated query) ORDER BY changeDate) AS sub LIMIT 100 OFFSET 1000;

-- 2. Select attributes into the DTOs, keep sorting
SELECT * FROM ...(complex generated query) WHERE person.id IN (idList); --> (mimic BaseAdoService.getByIds for DTOs, but keep the sorting given by the idList).

StefanKock avatar Nov 02 '22 10:11 StefanKock

Sound like this is worth a try. Please double check the results with "EXPLAIN ANALYZE" and https://explain.dalibo.com/

MartinWahnschaffe avatar Nov 02 '22 13:11 MartinWahnschaffe

-- 1. Find person ids of the persons in the page to show (adhere the sorting). SELECT DISTINCT person.id FROM (SELECT * FROM ...(complex generated query) ORDER BY changeDate) AS sub LIMIT 100 OFFSET 1000;

-- 2. Select attributes into the DTOs, keep sorting SELECT * FROM ...(complex generated query) WHERE person.id IN (idList); --> (mimic BaseAdoService.getByIds for DTOs, but keep the sorting given by the idList).

There is no easy solution for the first query. Problems:

  • the Criteria API does not support multiselect for subqueries
  • this could be tackled by either handling duplicate Ids Java side, which however interferes with offset an limit. Possible shifts in offset and limit are hard to track and error prone.
  • alternatively, the Criteria API can be skipped altogether and a raw SQL query can be used. This is somewhat clumsy, however

#10875 implements an alternative, where Ids of index list persons are read in a first query, omitting all columns but those needed for sorting and thus to handle 'distinct'. So the query to fetch Ids reduces columns to 'id' plus - by default - 'changedate' or - if sort properties era passed - the columns of these sort properties.

Testing on the dataset used so far, the query to read Ids of persons associated with cases takes ~230ms, fetching the actual data takes ~20ms (execution time before the refactoring: ~1sec 600ms).

Queries and exemplary analysis using explain are provided here: explainSqlPrefetchIds.zip

syntakker avatar Nov 09 '22 12:11 syntakker