data.gov
data.gov copied to clipboard
O+M 2022-10-13
As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.
Routine Tasks
-
Check Action tabs for each active repositories
- Inventory Restart Action
- Inventory deploy Action
- Catalog Restart Action
- Catalog Deploy Action
-
Solr Brokerpak Release Action
- Note the release version
-
EKS Brokerpak Release Action
- Note the release version
-
SMTP Brokerpak Release Action
- Note the release version
-
SSB Deploy Action
- Validate it is using the most recent (working) releases of each brokerpak.
-
Verify each Solr Leader/Followers are functional
Use this command to find Solr URLs and credentials in the
prod
space.$ cf t -s prod $ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"
- Verify their Start time is in sync with Solr Memory Alert history at path
/solr/#/
- Verify each follower stays with Solr leader at path
/solr/#/ckan/core-overview
- Verify each Solr is responsive by running a few queries at
/solr/#/ckan/query
- Inspect each Solr's logging for abnormal errors at
/solr/#/~logging
- Verify their Start time is in sync with Solr Memory Alert history at path
-
Examine the Solr Memory Utilization Graph to catch any abnormal incidences.
Log in to
tts-jump
AWS account with roleSSBDev@ssb-production
, go to custom SolrAlarm dashboard to see the graph for the past 24 hours. There should not be any Solr instance has MemoryUtilization go above 90% threshold. Each Solr should not restart too often (more than a few times a week) -
Verify harvesting jobs are running, go through Error reports to catch unusual errors that need attention [Wiki doc]
-
Go through NewRelic logs to make sure each app's log is current
-
Watch for user email requests
-
Triage DMARC Report from Google (daily) sent to [email protected] (only for catalog in prod).
-
Watch in #datagov-alerts and Vulnerable dependency notifications (daily email reports) for critical alerts.
Acceptance criteria
You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten.
- [ ] Audit log updated for AU-6 Log auditing (Friday).
- [x] Any New Relic alerts have been addressed or GH issues created.
- [x] Weekly Nessus scan has been triaged.
- [x] Weekly Snyk scan is complete.
- [x] Weekly resources.data.gov link scan
- [x] If received, the monthly Netsparker scan has been triaged.
- [x] Finishing the shift: Log the number of alerts
Updated with latest template.
Solr follower 0 went down today due to some weird SOlr behavior. The index folder was renamed to "/var/solr/data/ckan/data/index.20221007160458545" for unknown reason. We modified the solr_setup.sh file to address the issue and bring follower 0 back online.
Old O&M definitely done 😅