data.gov O+M 2022-10-13

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Routine Tasks

Check Action tabs for each active repositories
- Inventory Restart Action
- Inventory deploy Action
- Catalog Restart Action
- Catalog Deploy Action
- Solr Brokerpak Release Action
  - Note the release version
- EKS Brokerpak Release Action
  - Note the release version
- SMTP Brokerpak Release Action
  - Note the release version
- SSB Deploy Action
  - Validate it is using the most recent (working) releases of each brokerpak.
Verify each Solr Leader/Followers are functional

Use this command to find Solr URLs and credentials in the prod space.
```
$ cf t -s prod
$ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"
```
- Verify their Start time is in sync with Solr Memory Alert history at path /solr/#/
- Verify each follower stays with Solr leader at path /solr/#/ckan/core-overview
- Verify each Solr is responsive by running a few queries at /solr/#/ckan/query
- Inspect each Solr's logging for abnormal errors at /solr/#/~logging
Examine the Solr Memory Utilization Graph to catch any abnormal incidences.

Log in to tts-jump AWS account with role SSBDev@ssb-production, go to custom SolrAlarm dashboard to see the graph for the past 24 hours. There should not be any Solr instance has MemoryUtilization go above 90% threshold. Each Solr should not restart too often (more than a few times a week)
Verify harvesting jobs are running, go through Error reports to catch unusual errors that need attention [Wiki doc]
Go through NewRelic logs to make sure each app's log is current
Watch for user email requests
Triage DMARC Report from Google (daily) sent to [email protected] (only for catalog in prod).
Watch in #datagov-alerts and Vulnerable dependency notifications (daily email reports) for critical alerts.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten.

[ ] Audit log updated for AU-6 Log auditing (Friday).
[x] Any New Relic alerts have been addressed or GH issues created.
[x] Weekly Nessus scan has been triaged.
[x] Weekly Snyk scan is complete.
[x] Weekly resources.data.gov link scan
[x] If received, the monthly Netsparker scan has been triaged.
[x] Finishing the shift: Log the number of alerts

Sep 29 '22 21:09 hkdctol

Updated with latest template.

Sep 30 '22 15:09 FuhuXia

Solr follower 0 went down today due to some weird SOlr behavior. The index folder was renamed to "/var/solr/data/ckan/data/index.20221007160458545" for unknown reason. We modified the solr_setup.sh file to address the issue and bring follower 0 back online.

Oct 07 '22 21:10 FuhuXia

Old O&M definitely done 😅

Feb 02 '23 22:02 nickumia-reisys

data.gov data.gov copied to clipboard

O+M 2022-10-13

Routine Tasks

Acceptance criteria

data.gov
data.gov copied to clipboard