at_server
at_server copied to clipboard
Investigate and smooth out load spikes on prod worker nodes
Lead: @murali-shris
Describe the bug There are periodic large load spikes which correspond with scheduled jobs (compaction, scans, ...) which are straining our worker nodes
Expected outcome
- Shared documented understanding of which scheduled jobs are driving load
- Adjusted jobs schedules to spread load ~evenly over time
@cconstab @cpswan Can you please grant me access to the required monitoring dashboards to help me investigate this issue
Two possible causes of CPU spike
- sec check which runs every 15 mins. created inbound connection to secondary and runs scan
- hive expiry check which is a scheduled job on server which runs every 10 mins.This scans every key in hive and checks for expiry
merged PR to randomise hive expiry check https://github.com/atsign-foundation/at_server/pull/497
Moving the task to next sprint (PR-30) to validate the performance once the changes are deployed.
@cconstab @cpswan Is this issue still occurring in prod? any work to be done in the upcoming sprint related to load spikes?
Will take a look or @cpswan
@murali-shris things are a lot better, but I'm still seeing some hourly spikes, so maybe another scheduled job elsewhere in the secondary?
@murali-shris @cpswan should we move priority up to high for PR43 sprint planning?
@murali-shris @cpswan should we move priority up to high for PR43 sprint planning?
yes @ksanty ..we can revisit whether prod spike still exists