SORMAS-Project icon indicating copy to clipboard operation
SORMAS-Project copied to clipboard

Improve performance of getAllAfter queries into DTOs [8]

Open StefanKock opened this issue 2 years ago • 7 comments

Problem Description

As shown by the following analysis, many getAllAfter methods show an inperformant pattern:

  1. The "Entity"Service.getAllAfter method takes some seconds. As shown in https://github.com/hzi-braunschweig/SORMAS-Project/issues/8946#issuecomment-1129937176, this can be improved by initially fetching only the ids (reduced distinct effort) and using a dedicated index with appropriate sorting.
  2. "Entity"Service.inJurisdictionOrOwned per each entity used for pseudonymization seems to be inperformant. For Cases it took ~330ms per entity, for Persons ~0,3ms per entity with in clause (PersonService.getInJurisdictionIDs).
Analysis

Dataset:

  • 1225343 persons
  • 85677 cases
  • 114782 contacts
  • 236583 tasks
  • 2600890 immunizations
  • 1991 vaccinations
  • 15693 samples
  • 11846 events
  • 534 eventparticipants

The following measurements were taken from backend logs (EJB methods) and the Postgres logs. Observations:

  • the actual SQL queries are executed in few seconds while the EJB methods take minutes to complete
  • a relevant contribution to runtime comes from inJurisdictionOrOwned methods, here mainly the number of calls
  • batch size is relevant. Some requests did not complete within reasonable time for a batch size of 10000, these were measured for batch size 1000.

Requests:

  • http://localhost:6080/sormas-rest/persons/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 3 min 5 sec server: 2 min 58 sec image SQL queries: duration: 948.716 ms duration: 2297.275 ms duration: 4012.220 ms duration: 4393.538 ms duration: 696.277 ms ~ 12.5 sec Note: SQL query times are inaccurate, either due to an error in the analysis or changes introduced later, see later analysis of this method.

  • http://localhost:6080/sormas-rest/cases/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 8 min 21 sec server: 8 min 18 sec image SQL queries: duration: 787.011 ms duration: 449.616 ms duration: 146.224 ms duration: 178.515 ms duration: 154.552 ms duration: 136.836 ms duration: 142.597 ms duration: 143.688 ms duration: 150.584 ms duration: 152.298 ms duration: 133.814 ms duration: 147.687 ms ~ 3 sec

  • http://localhost:6080/sormas-rest/contacts/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 35 sec server: 35 sec image SQL queries: duration: 151.413 ms duration: 143.237 ms duration: 152.661 ms duration: 144.054 ms duration: 145.374 ms duration: 155.861 ms duration: 177.212 ms duration: 153.647 ms duration: 201.516 ms duration: 187.382 ms ~ 1.6 sec

  • http://localhost:6080/sormas-rest/tasks/all/1637090372005/1000/NO_LAST_SYNCED_UUID: 1 min 27 sec server: 1 min 24 sec image SQL queries: duration: 466.097 ms duration: 154.093 ms duration: 132.150 ms duration: 137.751 ms duration: 140.757 ms duration: 135.526 ms duration: 136.612 ms duration: 126.399 ms duration: 124.695 ms duration: 127.888 ms duration: 133.940 ms ~1.9 sec

  • http://localhost:6080/sormas-rest/samples/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 27 sec server: 27 sec image SQL queries: duration: 129.587 ms

  • http://localhost:6080/sormas-rest/immunizations/all/1637090372005/1000/NO_LAST_SYNCED_UUID (1min 50 sec) server: 1 min 48 sec image SQL queries: duration: 6985.128 ms duration: 106.608 ms ~ 7 sec

Proposed Change

  • [x] 1. Rewrite AdoServiceWithUserFilter.getAllAfter to first fetch the needed ids (see pattern in PersonService.getAllAfter, then fetch the entities by id with IN-clause (use BaseAdoService.getByIds).
  • [x] 2. Add indices for sorting.
  • [x] 3. Use the pattern of PersonService.getInJurisdictionIDs to query by ids with IN clause also for other entities where "Entity"Service.inJurisdictionOrOwned is currently running one query per Entity.

Acceptance Criteria

  • [x] An analysis before/after shows the performance improvement (also mentioning the amount of existing and queried entities). -> comment

Implementation Details

  • All getAllAfter and getInJurisdictionIDs methods avoid parameter limit exception with IterableHelper.executeBatched batching.

  • Remove or adapt not with superclass aligned getAllAfter implementations in:

    • CampaignService.getAllAfter
    • CampaignFormMetaService.getAllAfter
    • ContactService.getAllAfter (active contacts filter)
    • EventParticipantService.getAllAfter (not archived filter)
    • EventService.getAllAfter (configured EventUserFilterCriteria)
    • ImmunizationService.getAllAfter
    • BaseTravelEntryService.getAllAfter
  • Remove or adapt not with superclass aligned implementations parallel to getAllAfter:

    • CampaignFormDataService.getAllActiveAfter
    • TreatmentService.getAllActiveTreatmentsAfter
    • PathogenTestService.getAllActivePathogenTestsAfter
    • PrescriptionService.getAllActivePrescriptionsAfter
    • SampleService.getAllActiveSamplesAfter
    • ActionService.getAllActionsAfter
    • AdditionalTestService.getAllActiveAdditionalTestsAfter
    • ClinicalVisitService.getAllActiveClinicalVisitsAfter
    • TaskService.getAllActiveTasksAfter
    • CaseService.getAllActiveCasesAfter with boolean includeExtendedChangeDateFilters that is never used outside of tests (introduced with #2059, partly removed by #2674).

Additional Information

  • Sibling to #8946
  • Batch jurisdiction processing for Persons was done in #2130 (commit 305b169).
  • I tried to use CREATE INDEX CONCURRENTLY but that does not work within a transaction as it is used in the automatic db migration. If deployments get stuck because of INDEX creation (not sure what the likelyhood is), they can be created with CREATE INDEX CONCURRENTLY IF NOT EXISTS ... before on running instance.
  • This ticket improves the performance of the following functions:
    1. Sync for android-app (getAllAfter methods) for most to all entities.
    2. List (getIndexList) for AdditionalTest, PathogenTest
    3. List (getByUuids, getBy"Reference"/"criteria") for Case, Sample, Task, AdditionalTest, PathogenTest, Prescription, Treatment, Vaccination, Visit.

Performance Analysis

Preliminary analysis of performance before and after the changes introduced in PR #10282

(tested on dafd47df2f756277323e15d28a3c6f678b1864e6)

The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000. A substantial improvement can be observed for all entities with long response times:

cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04

An exception is pathogen test with no improvement / slightly worse performance:

pathogentests (1000) .........00:01:15 -> 00:01:26

The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).

Analysis data: performance_9320.zip

Detailed analysis

Full list of REST calls Performance of REST calls (.../<entities>/all/) before the refacoring:

actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01

... and after:

actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00

Execution details for selected entities

Cases before: image

after: image

Here the calls to CaseService.inJurisdictionOrOwned per selected case have been dropped, roughly accounting for the measurde difference.

Contacts before: image

after: image

As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned.

Immunizations before: image

after: image

Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned and FeatureConfigurationFacadeEjb.isPropertyValueTrue

Tasks before: image

after: image

Here, calls to TaskService.inJurisdictionOrOwned on a per object basis accounted for most of the time spent.

Analysis for different batch sizes

A analysis with different batch sizes after completion of this task shows:

  • request durations are substantially improved (see times above)
  • there is little overhead (long running SQL queries etc.) for small batch sizes: for batch size 1, no entity exceeds a response time of 1second
  • for larger batch sizes, response time is approximately linear to the batch size
  • this is consistent across all entities (response time for 1000 pathogentests is down to 15 sec now)

Response times for selected entities:

cases
(1) ....................00:00:01
(10) ...................00:00:03
(100) ..................00:00:08
(500) ..................00:00:36
(1000) .................00:01:11
(5000) .................00:05:52

persons
(1) ..................00:00:01
(10) .................00:00:00
(100) ................00:00:01
(500) ................00:00:04
(1000) ...............00:00:07
(5000) ...............00:00:33

samples
(1) ..................00:00:01
(10) .................00:00:01
(100) ................00:00:02
(500) ................00:00:10
(1000) ...............00:00:17
(5000) ...............00:01:10

Analysis data: differentBatchSizes.zip

StefanKock avatar May 24 '22 12:05 StefanKock

Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since} No matter the size of the response e.g. even if only 1 task will be returned matching the time criteria. No response even for 5 minutes. The highest possible timespan that we got working on some instances is last 4 hours. Even that one takes a couple of minutes to respond. Let us know if you need more info around the issue and keep us posted about the outcome here.

Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since}

Identifying the changed entities is what we want to improve. That includes also tasks.

StefanKock avatar May 31 '22 13:05 StefanKock

Perfect! Are there already some concrete plans when can this one be expected to make its way to production?

Discussed current status and next steps with @HolgerReiseVSys and @syntakker. Results:

  1. Update feature branch to current development branch
  2. @syntakker will start with performance tests on cases, contacts, immunizations, tasks, samples as prechecks
  3. Integrate CaseService.getAllActiveCasesAfter as all the other sibling methods.
  4. Refactor BaseAdoService.getAll to getList to be reusable with other filters, make it protected or leave public?
  5. Indizes for all entities that implement AdoServiceWithUserFilter

StefanKock avatar Sep 06 '22 08:09 StefanKock

Helper used to create sort indices: IndexBuilderHelper.xlsx

StefanKock avatar Sep 07 '22 15:09 StefanKock

Preliminary analysis of performance before and after the changes introduced in PR #10282

(tested on dafd47df2f756277323e15d28a3c6f678b1864e6)

The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000. A substantial improvement can be observed for all entities with long response times:

cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04

An exception is pathogen test with no improvement / slightly worse performance:

pathogentests (1000) .........00:01:15 -> 00:01:26

The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).

Analysis data: performance_9320.zip

Detailed analysis

Full list of REST calls Performance of REST calls (.../<entities>/all/) before the refacoring:

actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01

... and after:

actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00

Execution details for selected entities

Cases before: image

after: image

Here the calls to CaseService.inJurisdictionOrOwned per selected case have been dropped, roughly accounting for the measurde difference.

Contacts before: image

after: image

As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned.

Immunizations before: image

after: image

Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned and FeatureConfigurationFacadeEjb.isPropertyValueTrue

Tasks before: image

after: image

Here, calls to TaskService.inJurisdictionOrOwned on a per object basis accounted for most of the time spent.

syntakker avatar Sep 14 '22 04:09 syntakker

Changes on performance tooling still open

StefanKock avatar Sep 21 '22 08:09 StefanKock