SORMAS-Project
SORMAS-Project copied to clipboard
Improve performance of PersonFacadeEjb.getIndexList [5]
Problem Description
This summarizes the problem for getIndexList methods in general, not only for Persons.
For large datasets, some of the getIndexList
methods are slow, performance should be improved here.
Dataset:
- 1225343 persons
- 85677 cases
- 114782 contacts
- 236583 tasks
- 2600890 immunizations
- 1991 vaccinations
- 15693 samples
- 11846 events
- 534 eventparticipants
Result of a performance analysis:
The runtime is mainly due to long running SQL queries, optimization is needed here.
Some preliminary insights:
- in the following query from
PersonFacadeEjb.getIndexList
, runtime is reduced by facto 0.5 when omitting the joins on district, subselect andcase
statement. Since the query is limited to 100 result, a viable solution might be to select Ids /Uuids first and fill the remaining data after for the returned entities. This, however, with respect to the properties possibly being used a sorting criteria.
select
distinct person0_.uuid as col_0_0_,
person0_.firstName as col_1_0_,
person0_.lastName as col_2_0_,
person0_.approximateAge as col_3_0_,
person0_.approximateAgeType as col_4_0_,
person0_.birthdate_dd as col_5_0_,
person0_.birthdate_mm as col_6_0_,
person0_.birthdate_yyyy as col_7_0_,
person0_.sex as col_8_0_,
district2_.name as col_9_0_,
location1_.street as col_10_0_,
location1_.houseNumber as col_11_0_,
location1_.postalCode as col_12_0_,
location1_.city as col_13_0_,
(select
personcont12_.contactInformation
from
PersonContactDetail personcont12_
where
personcont12_.person_id=person0_.id
and personcont12_.primaryContact=true
and personcont12_.personContactDetailType='PHONE') as col_14_0_,
(select
personcont13_.contactInformation
from
PersonContactDetail personcont13_
where
personcont13_.person_id=person0_.id
and personcont13_.primaryContact=true
and personcont13_.personContactDetailType='EMAIL') as col_15_0_,
person0_.changeDate as col_16_0_,
case
when 1=1
or (cases5_.reportingUser_id is not null)
and cases5_.reportingUser_id=8113961
or 1=1
or (contacts6_.reportingUser_id is not null)
and contacts6_.reportingUser_id=8113961
or 1=1
or 1=1
or (eventparti8_.reportingUser_id is not null)
and eventparti8_.reportingUser_id=8113961
or 1=1
or (travelentr9_.reportingUser_id is not null)
and travelentr9_.reportingUser_id=8113961
or 1=1
or (immunizati11_.reportingUser_id is not null)
and immunizati11_.reportingUser_id=8113961
or 1=1 then true
else false
end as col_17_0_
from
Person person0_
left outer join
Location location1_
on person0_.address_id=location1_.id
left outer join
District district2_
on location1_.district_id=district2_.id
left outer join
Region region3_
on location1_.region_id=region3_.id
left outer join
Community community4_
on location1_.community_id=community4_.id
inner join
cases cases5_
on person0_.id=cases5_.person_id
left outer join
contact contacts6_
on person0_.id=contacts6_.person_id
left outer join
cases case7_
on contacts6_.caze_id=case7_.id
left outer join
EventParticipant eventparti8_
on person0_.id=eventparti8_.person_id
left outer join
travelentry travelentr9_
on person0_.id=travelentr9_.person_id
left outer join
cases case10_
on travelentr9_.resultingCase_id=case10_.id
left outer join
immunization immunizati11_
on person0_.id=immunizati11_.person_id
where
cases5_.deleted=false
order by
person0_.changeDate desc limit 100;
- for
DirectoryImmunizationService.getIndexList
a similarly drastic reduction of runtimes can be observed by ommiting joins (e.g., district) - it is notable that
getIndexList
methods are performing reasonably forCaseFacadeEjb
andContactFacadeEjb
, which deserves a closer look when working onPersonFacadeEjb.getIndexList
,TaskFacadeEjb.getIndexList
,DirectoryImmunizationService.getIndexList
and the like - adding database indices in relevant places should be considered
Proposed Change
We will start by improving the performance of the PersonFacadeEjb.getIndexList method. If that works out, we should create separate issues to improve the performance for cases, contacts, events, event participants, samples, tasks, immunizations and travel entries (so basically for every entity that has a directory view).
#8946 has proven that by splitting the query into fetching the sorted ids first and afterwards doing another query to get the needed data, we can improve the performance a lot, because the database does not have to fetch and sort the whole table to then only return a limited number of results.
Before starting this make sure to get a performance test database and
- [ ] Create an explain analyze of the current query execute by getIndexList following this guide (you can also get the SQL by enabling the output in the persistence.xml). The annoying part here is to replace the params
The current implementation then should be changed as follows:
- [ ] 1. Rewrite PersonFacadeEjb.getIndexList to first fetch the needed ids (see pattern in PersonService.getAllAfter), then fetch the IndexDto by id with IN-clause. https://github.com/hzi-braunschweig/SORMAS-Project/issues/8946
- [ ] 2. Add indices for sorting by first name and by last name. Adding indices for sorting by other fields probably doesn't make a lot of sense.
Finally:
- [ ] Measure the query performance again (this time it will obviously be two queries).
- [ ] If this works out as expected get in contact with @StefanKock or @MartinWahnschaffe to refine #8938 for the other entities.
Acceptance Criteria
- PersonFacadeEjb.getIndexList returns the same results as before
- The performance of the method is improved
Implementation Details
Additional Information
Further instrumentation shows that pseudonymization, suspected to be responsible for slow response, can likely be ruled out as a relevant cause. This is also confirmed by the analysis of #9193, where large numbers of entities are pseudonymized. Here, processing of 85267 case entities by DtoPseudonymizer.pseudonymizeDtoCollection
does contribute ~18 sec of runtime, which however seems acceptable for the number of entities and contrasts to an overall runtime of > 10 min for the overall method.
To be found, however, are long running SQL queries with massive joins.
For ImmunizationFacadeEjb.getIndexList
:
SQL query:
select
distinct directoryi0_.uuid as col_0_0_,
person1_.uuid as col_1_0_,
person1_.firstName as col_2_0_,
person1_.lastName as col_3_0_,
directoryi0_.disease as col_4_0_,
person1_.approximateAge as col_5_0_,
person1_.approximateAgeType as col_6_0_,
person1_.birthdate_dd as col_7_0_,
person1_.birthdate_mm as col_8_0_,
person1_.birthdate_yyyy as col_9_0_,
person1_.sex as col_10_0_,
district3_.name as col_11_0_,
directoryi0_.meansOfImmunization as col_12_0_,
directoryi0_.immunizationManagementStatus as col_13_0_,
directoryi0_.immunizationStatus as col_14_0_,
directoryi0_.startDate as col_15_0_,
directoryi0_.endDate as col_16_0_,
lastvaccin4_.vaccineType as col_17_0_,
directoryi0_.recoveryDate as col_18_0_,
case
when (directoryi0_.reportingUser_id is not null)
and directoryi0_.reportingUser_id=8113961
or 1=1 then true
else false
end as col_19_0_,
directoryi0_.changeDate as col_20_0_
from
immunization directoryi0_
left outer join
Person person1_
on directoryi0_.person_id=person1_.id
left outer join
Location location2_
on person1_.address_id=location2_.id
left outer join
District district3_
on location2_.district_id=district3_.id
left outer join
(
SELECT
v.immunization_id,
vaccinetype
FROM
vaccination v
INNER JOIN
(
SELECT
immunization_id,
MAX(vaccinationdate) maxdate
FROM
vaccination
GROUP BY
immunization_id
) maxdates
ON v.immunization_id=maxdates.immunization_id
AND v.vaccinationdate=maxdates.maxdate
) lastvaccin4_
on directoryi0_.id=lastvaccin4_.immunization_id
left outer join
Region region5_
on directoryi0_.responsibleRegion_id=region5_.id
left outer join
District district6_
on directoryi0_.responsibleDistrict_id=district6_.id
left outer join
Community community7_
on directoryi0_.responsibleCommunity_id=community7_.id
left outer join
Facility facility8_
on directoryi0_.healthFacility_id=facility8_.id
where
(
(
directoryi0_.reportingUser_id is not null
)
and directoryi0_.reportingUser_id=8113961
or 1=1
)
and directoryi0_.deleted=false
order by
directoryi0_.changeDate desc limit 100
To enhance the analysis of SQL queries, the script SORMAS-Project/sormas-backend/src/test/resources/performance/sql_log.sh
is used to capture logging from the SORMAS Postgres container, including automated parameter replacement and query prefixing with EXPLAIN
.
Results for long running queries: explainSql.zip
- preloading of ids, as proposed in the issue description, is not feasible, as entity attributes have to be available for sorting
- a relevant cost factor is sorting
- for
PersonFacadeEjb.getIndexList
,CaseFacadeEjb.getIndexPage
andContactFacadeEjb.getIndexPage
: an improvement is achieved by replacing subqueries for phone and email by joins onPersonContactDetail
(forPersonFacadeEjb.getIndexList
the main query's time was reduced from 4s 195ms to 1s 578ms) - for
ImmunizationFacadeEjb.getIndexList
: the dataset used for analysis has 2,600,890 entries for immunizations. These are propagated over several joins without reducing the result that take around 1s 30ms each. A solution is not obvious here, since the result set is only reduced after sorting with limit, for which attributes from the joined tables may be necessary. The poor performance seems also to be connected to the use of subselects in the classesFirstVaccinationDate
,LastVaccineType
andLastVaccinationDate
which are attributes ofDirectoryImmunization
- in several places postgres uses a Sequence Scan where an index scan might perform better, e.g., when joining
events
andlocation
for sample counts (see filesexplain_samples_count*
) oreventparticipant
andperson
for he samples index list (seeexplain_samples_count.*
). Indices on foreign keys are present, the reason for the use of Sequence Scans in these places is subject to further investigation.
@syntakker and I discussed his first approach (PR #10608).
- He found that fetching for phone number and email address is causing delays and can probably be improved. But this was not in scope of this ticket.
- It was not in scope of this ticket to touch other CoreAdo queries, only work on Person as proof of concept.
- The intention of the ticket was to avoid the costly
DISTINCT
on all selected attributes. We discussed a possible approach with the following two step select.
-- OLD
-- sorted by changeDate but possibly other attributes given by PersonCriteria
-- Problem: DISTINCT requires sorting on all selected attributes
SELECT DISTINCT * FROM ...(complex generated query) ORDER BY changeDate LIMIT 100 OFFSET 1000;
-- NEW
-- 1. Find person ids of the persons in the page to show (adhere the sorting).
SELECT DISTINCT person.id FROM (SELECT * FROM ...(complex generated query) ORDER BY changeDate) AS sub LIMIT 100 OFFSET 1000;
-- 2. Select attributes into the DTOs, keep sorting
SELECT * FROM ...(complex generated query) WHERE person.id IN (idList); --> (mimic BaseAdoService.getByIds for DTOs, but keep the sorting given by the idList).
Sound like this is worth a try. Please double check the results with "EXPLAIN ANALYZE" and https://explain.dalibo.com/
-- 1. Find person ids of the persons in the page to show (adhere the sorting). SELECT DISTINCT person.id FROM (SELECT * FROM ...(complex generated query) ORDER BY changeDate) AS sub LIMIT 100 OFFSET 1000;
-- 2. Select attributes into the DTOs, keep sorting SELECT * FROM ...(complex generated query) WHERE person.id IN (idList); --> (mimic BaseAdoService.getByIds for DTOs, but keep the sorting given by the idList).
There is no easy solution for the first query. Problems:
- the Criteria API does not support multiselect for subqueries
- this could be tackled by either handling duplicate Ids Java side, which however interferes with offset an limit. Possible shifts in offset and limit are hard to track and error prone.
- alternatively, the Criteria API can be skipped altogether and a raw SQL query can be used. This is somewhat clumsy, however
#10875 implements an alternative, where Ids of index list persons are read in a first query, omitting all columns but those needed for sorting and thus to handle 'distinct'. So the query to fetch Ids reduces columns to 'id' plus - by default - 'changedate' or - if sort properties era passed - the columns of these sort properties.
Testing on the dataset used so far, the query to read Ids of persons associated with cases takes ~230ms, fetching the actual data takes ~20ms (execution time before the refactoring: ~1sec 600ms).
Queries and exemplary analysis using explain
are provided here:
explainSqlPrefetchIds.zip