data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

O+M 2022-10-13

Open hkdctol opened this issue 2 years ago • 2 comments

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Routine Tasks

  • Check Action tabs for each active repositories

  • Verify each Solr Leader/Followers are functional

    Use this command to find Solr URLs and credentials in the prod space.

    $ cf t -s prod
    $ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"
    
    • Verify their Start time is in sync with Solr Memory Alert history at path /solr/#/
    • Verify each follower stays with Solr leader at path /solr/#/ckan/core-overview
    • Verify each Solr is responsive by running a few queries at /solr/#/ckan/query
    • Inspect each Solr's logging for abnormal errors at /solr/#/~logging
  • Examine the Solr Memory Utilization Graph to catch any abnormal incidences.

    Log in to tts-jump AWS account with role SSBDev@ssb-production, go to custom SolrAlarm dashboard to see the graph for the past 24 hours. There should not be any Solr instance has MemoryUtilization go above 90% threshold. Each Solr should not restart too often (more than a few times a week)

  • Verify harvesting jobs are running, go through Error reports to catch unusual errors that need attention [Wiki doc]

  • Go through NewRelic logs to make sure each app's log is current

  • Watch for user email requests

  • Triage DMARC Report from Google (daily) sent to [email protected] (only for catalog in prod).

  • Watch in #datagov-alerts and Vulnerable dependency notifications (daily email reports) for critical alerts.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten.

hkdctol avatar Sep 29 '22 21:09 hkdctol

Updated with latest template.

FuhuXia avatar Sep 30 '22 15:09 FuhuXia

Solr follower 0 went down today due to some weird SOlr behavior. The index folder was renamed to "/var/solr/data/ckan/data/index.20221007160458545" for unknown reason. We modified the solr_setup.sh file to address the issue and bring follower 0 back online.

FuhuXia avatar Oct 07 '22 21:10 FuhuXia

Old O&M definitely done 😅

nickumia-reisys avatar Feb 02 '23 22:02 nickumia-reisys