SORMAS-Project
SORMAS-Project copied to clipboard
Improve performance of getAllAfter queries into DTOs [8]
Problem Description
As shown by the following analysis, many getAllAfter
methods show an inperformant pattern:
- The
"Entity"Service.getAllAfter
method takes some seconds. As shown in https://github.com/hzi-braunschweig/SORMAS-Project/issues/8946#issuecomment-1129937176, this can be improved by initially fetching only the ids (reduced distinct effort) and using a dedicated index with appropriate sorting. -
"Entity"Service.inJurisdictionOrOwned
per each entity used for pseudonymization seems to be inperformant. For Cases it took ~330ms per entity, for Persons ~0,3ms per entity with in clause (PersonService.getInJurisdictionIDs
).
Analysis
Dataset:
- 1225343 persons
- 85677 cases
- 114782 contacts
- 236583 tasks
- 2600890 immunizations
- 1991 vaccinations
- 15693 samples
- 11846 events
- 534 eventparticipants
The following measurements were taken from backend logs (EJB methods) and the Postgres logs. Observations:
- the actual SQL queries are executed in few seconds while the EJB methods take minutes to complete
- a relevant contribution to runtime comes from
inJurisdictionOrOwned
methods, here mainly the number of calls - batch size is relevant. Some requests did not complete within reasonable time for a batch size of 10000, these were measured for batch size 1000.
Requests:
-
http://localhost:6080/sormas-rest/persons/all/1637090372005/10000/NO_LAST_SYNCED_UUID
: 3 min 5 sec server: 2 min 58 secSQL queries: duration: 948.716 ms duration: 2297.275 ms duration: 4012.220 ms duration: 4393.538 ms duration: 696.277 ms ~ 12.5 sec Note: SQL query times are inaccurate, either due to an error in the analysis or changes introduced later, see later analysis of this method.
-
http://localhost:6080/sormas-rest/cases/all/1637090372005/10000/NO_LAST_SYNCED_UUID
: 8 min 21 sec server: 8 min 18 secSQL queries: duration: 787.011 ms duration: 449.616 ms duration: 146.224 ms duration: 178.515 ms duration: 154.552 ms duration: 136.836 ms duration: 142.597 ms duration: 143.688 ms duration: 150.584 ms duration: 152.298 ms duration: 133.814 ms duration: 147.687 ms ~ 3 sec
-
http://localhost:6080/sormas-rest/contacts/all/1637090372005/10000/NO_LAST_SYNCED_UUID
: 35 sec server: 35 secSQL queries: duration: 151.413 ms duration: 143.237 ms duration: 152.661 ms duration: 144.054 ms duration: 145.374 ms duration: 155.861 ms duration: 177.212 ms duration: 153.647 ms duration: 201.516 ms duration: 187.382 ms ~ 1.6 sec
-
http://localhost:6080/sormas-rest/tasks/all/1637090372005/1000/NO_LAST_SYNCED_UUID
: 1 min 27 sec server: 1 min 24 secSQL queries: duration: 466.097 ms duration: 154.093 ms duration: 132.150 ms duration: 137.751 ms duration: 140.757 ms duration: 135.526 ms duration: 136.612 ms duration: 126.399 ms duration: 124.695 ms duration: 127.888 ms duration: 133.940 ms ~1.9 sec
-
http://localhost:6080/sormas-rest/samples/all/1637090372005/10000/NO_LAST_SYNCED_UUID
: 27 sec server: 27 secSQL queries: duration: 129.587 ms
-
http://localhost:6080/sormas-rest/immunizations/all/1637090372005/1000/NO_LAST_SYNCED_UUID
(1min 50 sec) server: 1 min 48 secSQL queries: duration: 6985.128 ms duration: 106.608 ms ~ 7 sec
Proposed Change
- [x] 1. Rewrite
AdoServiceWithUserFilter.getAllAfter
to first fetch the needed ids (see pattern inPersonService.getAllAfter
, then fetch the entities by id with IN-clause (useBaseAdoService.getByIds
). - [x] 2. Add indices for sorting.
- [x] 3. Use the pattern of
PersonService.getInJurisdictionIDs
to query by ids with IN clause also for other entities where"Entity"Service.inJurisdictionOrOwned
is currently running one query per Entity.
Acceptance Criteria
- [x] An analysis before/after shows the performance improvement (also mentioning the amount of existing and queried entities). -> comment
Implementation Details
-
All
getAllAfter
andgetInJurisdictionIDs
methods avoid parameter limit exception withIterableHelper.executeBatched
batching. -
Remove or adapt not with superclass aligned
getAllAfter
implementations in:- CampaignService.getAllAfter
- CampaignFormMetaService.getAllAfter
- ContactService.getAllAfter (active contacts filter)
- EventParticipantService.getAllAfter (not archived filter)
- EventService.getAllAfter (configured
EventUserFilterCriteria
) - ImmunizationService.getAllAfter
- BaseTravelEntryService.getAllAfter
-
Remove or adapt not with superclass aligned implementations parallel to
getAllAfter
:- CampaignFormDataService.getAllActiveAfter
- TreatmentService.getAllActiveTreatmentsAfter
- PathogenTestService.getAllActivePathogenTestsAfter
- PrescriptionService.getAllActivePrescriptionsAfter
- SampleService.getAllActiveSamplesAfter
- ActionService.getAllActionsAfter
- AdditionalTestService.getAllActiveAdditionalTestsAfter
- ClinicalVisitService.getAllActiveClinicalVisitsAfter
- TaskService.getAllActiveTasksAfter
- CaseService.getAllActiveCasesAfter with
boolean includeExtendedChangeDateFilters
that is never used outside of tests (introduced with #2059, partly removed by #2674).
Additional Information
- Sibling to #8946
- Batch jurisdiction processing for Persons was done in #2130 (commit 305b169).
- I tried to use CREATE INDEX CONCURRENTLY but that does not work within a transaction as it is used in the automatic db migration. If deployments get stuck because of INDEX creation (not sure what the likelyhood is), they can be created with
CREATE INDEX CONCURRENTLY IF NOT EXISTS ...
before on running instance. - This ticket improves the performance of the following functions:
- Sync for android-app (
getAllAfter
methods) for most to all entities. - List (
getIndexList
) for AdditionalTest, PathogenTest - List (
getByUuids
,getBy"Reference"/"criteria"
) for Case, Sample, Task, AdditionalTest, PathogenTest, Prescription, Treatment, Vaccination, Visit.
- Sync for android-app (
Performance Analysis
Preliminary analysis of performance before and after the changes introduced in PR #10282
(tested on dafd47df2f756277323e15d28a3c6f678b1864e6)
The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000. A substantial improvement can be observed for all entities with long response times:
cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04
An exception is pathogen test with no improvement / slightly worse performance:
pathogentests (1000) .........00:01:15 -> 00:01:26
The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).
Analysis data: performance_9320.zip
Detailed analysis
Full list of REST calls
Performance of REST calls (.../<entities>/all/
) before the refacoring:
actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01
... and after:
actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00
Execution details for selected entities
Cases
before:
after:
Here the calls to CaseService.inJurisdictionOrOwned
per selected case have been dropped, roughly accounting for the measurde difference.
Contacts
before:
after:
As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned
.
Immunizations
before:
after:
Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned
and FeatureConfigurationFacadeEjb.isPropertyValueTrue
Tasks
before:
after:
Here, calls to TaskService.inJurisdictionOrOwned
on a per object basis accounted for most of the time spent.
Analysis for different batch sizes
A analysis with different batch sizes after completion of this task shows:
- request durations are substantially improved (see times above)
- there is little overhead (long running SQL queries etc.) for small batch sizes: for batch size 1, no entity exceeds a response time of 1second
- for larger batch sizes, response time is approximately linear to the batch size
- this is consistent across all entities (response time for 1000
pathogentests
is down to 15 sec now)
Response times for selected entities:
cases
(1) ....................00:00:01
(10) ...................00:00:03
(100) ..................00:00:08
(500) ..................00:00:36
(1000) .................00:01:11
(5000) .................00:05:52
persons
(1) ..................00:00:01
(10) .................00:00:00
(100) ................00:00:01
(500) ................00:00:04
(1000) ...............00:00:07
(5000) ...............00:00:33
samples
(1) ..................00:00:01
(10) .................00:00:01
(100) ................00:00:02
(500) ................00:00:10
(1000) ...............00:00:17
(5000) ...............00:01:10
Analysis data: differentBatchSizes.zip
Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since} No matter the size of the response e.g. even if only 1 task will be returned matching the time criteria. No response even for 5 minutes. The highest possible timespan that we got working on some instances is last 4 hours. Even that one takes a couple of minutes to respond. Let us know if you need more info around the issue and keep us posted about the outcome here.
Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since}
Identifying the changed entities is what we want to improve. That includes also tasks.
Perfect! Are there already some concrete plans when can this one be expected to make its way to production?
Discussed current status and next steps with @HolgerReiseVSys and @syntakker. Results:
- Update feature branch to current development branch
- @syntakker will start with performance tests on cases, contacts, immunizations, tasks, samples as prechecks
- Integrate CaseService.getAllActiveCasesAfter as all the other sibling methods.
- Refactor
BaseAdoService.getAll
togetList
to be reusable with other filters, make it protected or leave public? - Indizes for all entities that implement AdoServiceWithUserFilter
Helper used to create sort indices: IndexBuilderHelper.xlsx
Preliminary analysis of performance before and after the changes introduced in PR #10282
(tested on dafd47df2f756277323e15d28a3c6f678b1864e6)
The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000. A substantial improvement can be observed for all entities with long response times:
cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04
An exception is pathogen test with no improvement / slightly worse performance:
pathogentests (1000) .........00:01:15 -> 00:01:26
The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).
Analysis data: performance_9320.zip
Detailed analysis
Full list of REST calls
Performance of REST calls (.../<entities>/all/
) before the refacoring:
actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01
... and after:
actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00
Execution details for selected entities
Cases
before:
after:
Here the calls to CaseService.inJurisdictionOrOwned
per selected case have been dropped, roughly accounting for the measurde difference.
Contacts
before:
after:
As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned
.
Immunizations
before:
after:
Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned
and FeatureConfigurationFacadeEjb.isPropertyValueTrue
Tasks
before:
after:
Here, calls to TaskService.inJurisdictionOrOwned
on a per object basis accounted for most of the time spent.
Changes on performance tooling still open