at_server Investigate and smooth out load spikes on prod worker nodes

Investigate and smooth out load spikes on prod worker nodes

Open gkc opened this issue 3 years ago • 9 comments

Lead: @murali-shris

Describe the bug There are periodic large load spikes which correspond with scheduled jobs (compaction, scans, ...) which are straining our worker nodes

Expected outcome

Shared documented understanding of which scheduled jobs are driving load
Adjusted jobs schedules to spread load ~evenly over time

Jan 24 '22 13:01 gkc

@cconstab @cpswan Can you please grant me access to the required monitoring dashboards to help me investigate this issue

Jan 31 '22 06:01 murali-shris

Two possible causes of CPU spike

sec check which runs every 15 mins. created inbound connection to secondary and runs scan
hive expiry check which is a scheduled job on server which runs every 10 mins.This scans every key in hive and checks for expiry

Feb 02 '22 08:02 murali-shris

merged PR to randomise hive expiry check https://github.com/atsign-foundation/at_server/pull/497

Feb 04 '22 08:02 murali-shris

Moving the task to next sprint (PR-30) to validate the performance once the changes are deployed.

Feb 07 '22 09:02 sitaram-kalluri

@cconstab @cpswan Is this issue still occurring in prod? any work to be done in the upcoming sprint related to load spikes?

Feb 21 '22 04:02 murali-shris

Will take a look or @cpswan

Feb 21 '22 04:02 cconstab

@murali-shris things are a lot better, but I'm still seeing some hourly spikes, so maybe another scheduled job elsewhere in the secondary?

Feb 21 '22 12:02 cpswan

@murali-shris @cpswan should we move priority up to high for PR43 sprint planning?

Jul 28 '22 02:07 ksanty

@murali-shris @cpswan should we move priority up to high for PR43 sprint planning?

yes @ksanty ..we can revisit whether prod spike still exists

Jul 28 '22 07:07 murali-shris

at_server at_server copied to clipboard

Investigate and smooth out load spikes on prod worker nodes

at_server
at_server copied to clipboard