le-ref-architecture-doc icon indicating copy to clipboard operation
le-ref-architecture-doc copied to clipboard

Feature | Leverage Reference Architecture Workloads Production Readiness Checklist

Open exequielrafaela opened this issue 3 years ago • 0 comments

What?

  • ✅ Have a Leverage Reference Architecture Workloads Production Readiness Checklist

Reference articles

  1. [Main article] https://medium.com/@aleksei.kornev/production-readiness-checklist-for-backend-applications-8d2b0c57ccec
  2. [Must read] https://cloud.google.com/architecture/devops/technical
  3. https://gruntwork.io/devops-checklist
  4. https://devopschecklist.com/
  5. https://www.weave.works/blog/production-ready-checklist-kubernetes
  6. https://go.weave.works/production-ready-kubernetes-checklist

Why?

  • Before go-live Leverage users should recollect everything relevant for your production launch and try to apply to the current case.
  • Organize information better and come up with an idea of well known checklists for this purpose.

Checklist Template

VCS (Version Control)

  • ✅ All components should be under versioned under a VCS
    • Infra components (IaC)
    • App components
    • Config components
  • ✅ You must have a clear branching strategy definition
  • ✅ [Nice to have] trunk-based development. If it fits your project it's a proven DevOps methodology best practice.
  • ✅ According to git-flow that you use need to assign tags with versions (eg: sem-ver release mgmt) to prepare branches for quick fixes, rollbacks, etc.

CI (Continuous Integration)

  • ✅ Release strategy process [in case if there is an old version of product] | one Need to describe the procedure that you are going to use, two Bigbang, three Rolling upgrade, four Canary.
  • ✅ CI must be automated and triggered (at least every PR to master)
  • ✅ Developers integrate all their work into the main version of the code base (known as trunk -> eg master) on a regular basis (at least daily).
  • ✅ A set of automated tests is run both before and after the merge in order to validate that the changes don't introduce regression bugs.
  • ✅ [Nice to have] Shift left on security DevOps based strategy A set of automated security tests integrating security testing and controls into the daily work of development, QA, and operations (much of this work can be automated and put into your CI/CD pipelines)
  • ✅ If these automated tests fail, be able to stop what they are doing to fix the problem immediately.
  • ✅ Grant that developer branches don't diverge significantly from trunk.
  • ✅ Production sign off process - Need to have a person (Release Manager) who is responsible for production deployment and sign off that production is ready to go

Tests (Continuous Testing)

  • ✅ [Nice to have] Clear testing strategy already in place
  • ✅ [Nice to have] Unit tests automatically triggered (at least every PR to master)
  • ✅ [Nice to have] Integration tests automatically triggered (at least every PR to master)
  • ✅ [Nice to have] Functional tests automatically triggered (at least every PR)
  • ✅ Need to have a list of smoke tests (it’s good if they are automated) that could show that production is up and running

Deployment

Security | Access (IAM)

  • ✅ Only the production team should have access to production environments
  • ✅ Developers may have access to logs and monitoring information

Security | Secrets

  • ✅ Should be stored in any service which is responsible for secret management (should be HA)
  • ✅ None of the passwords or keys should be stored anywhere
  • ✅ Secret management strategy should be defined along with people who have keys

Data Replication

  • ✅ Databases should be deployed at least with a minimum required number of a node with a replication factor that helps you to recover as soon as possible
  • ✅ File or object storage solutions your workload components depends (compute, controllers, etc) should be deployed at least with a minimum required number of a node with a replication factor that helps you to recover as soon as possible
  • ✅ [Nice to have] Database change management with automated DB migrations
  • ✅ [Nice to have] File or object storage solutions your tools components depends (ci/cd, monitoring, logs) should be deployed at least with a minimum required number of a node with a replication factor that helps you to recover as soon as possible

Backups

  • ✅ The backup strategy should be applied to each database
  • ✅ The backup strategy should be applied to each file storage solution (eg: VMs and K8s root volumes and NFS)
  • ✅ The backup strategy should be applied to each object storage solution (eg: AWS S3)
  • ✅ Backups should have a validation process in place
  • ✅ process should be automated and tested

Performance and Capacity Planning

  • ✅ Make sure that you understand the number of necessary resources for each component in a system
  • ✅ Make sure that you configure all limits for the services. To avoid situations when because of the memory leak in one service you kill everything around.
  • ✅ [Nice to have] Automated performance and load testing

Compute

  • ✅ Validate your're correctly implementing a Cloud Native approach. :Check 5 NIST characteristics of cloud computing are being attended
    • On demand self-service: provision computing resources as needed and automated via IaC
    • Broad network access: multi-platforms mobile phones, tablets, laptops, and workstations
    • Resource pooling: provider resources are pooled in a multi-tenant model, with physical and virtual resources dynamically assigned on-demand.
    • Rapid elasticity: horizontal vs vertical scaling
    • Measured service: keep these metrics in sight.
  • ✅ Time configuration / NTP [optional] - Make sure that you have synced time over all nodes of your cluster

Network

  • ✅ The network is isolated from the internet(nothing is reachable from the internet)
  • ✅ APIs are covered by API Gateway
  • ✅ Databases are available only from the production network
  • ✅ [Nice to have] Application is deployed to at least 2 different locations(VPC peering could be configured between location)

Monitoring & Health Checks

  • ✅ Need to monitor each service of your application
  • ✅ Need to monitor databases
  • ✅ Need to monitor 3rd party service - 3rd party services health check should be either checked by application service itself or has his health check endpoints or needs to make some scripts that check the health of 3rd party service.
  • ✅ Need to monitor Kubernetes(in case you have it)
  • ✅ Configure request tracking (in case of multi-services)
  • ✅ Application Services should provide health check endpoints
  • ✅ [Nice to have] SRE team following DevOps practices

Log Aggregation / Centralization

  • ✅ Logs are centralized to log aggregation service
  • ✅ Add filters at least for ERROR and WARN
  • ✅ Add filters per service / application / component if needed

Alerts

  • ✅ Alerts are configured at least for basic metrics so that CPU, memory, IOPS, disk space
  • ✅ [Nice to have] Configure alerts for errors in logs
  • ✅ [Nice to have] Use error reporting services
  • ✅ Have a mechanism to turn off alerts during deployment (if deployments could generate expected alerts avoiding false positives)
  • ✅ All alerts should be tracked

Documentation & Wiki

  • ✅ Contains common issues and resolutions
  • ✅ Contains instructions about production management(deployment commands examples, purge scripts, other automation)

DR (Disaster Recovery)

  • ✅ Need to have a disaster recovery plan with different levels of details and ETAs for example: how to recover one service, how to recover a database, how to recover the whole cluster, how to recover the whole region

exequielrafaela avatar May 05 '21 17:05 exequielrafaela