site-reliability-engineer-guide
site-reliability-engineer-guide copied to clipboard
Site Reliability Engineer guide
Collection of books, research papers, videos and articles for mastering Site Reliability Engineer proficiency.
Books
SRE
- [ ] Site Reliability Engineering: How Google Runs Production Systems
- [ ] Site Reliability Engineering: The Site Reliability Workbook
- [ ] Building Secure & Reliable Systems
Kubernetes platform and applications
- [ ] Docker up and running
- [ ] Kubernetes Up and Running By Brendan Burns, Kelsey Hightower, Joe Beda
- [ ] Microservices in Production
- [ ] Designing Data-Intensive Applications
- [ ] Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services - Free to download
- [ ] Software Engineering at Google - Free to download
Compute, Networking and Storage - theory and practice
- [ ] Modern Operating Systems Tanenbaum, Andrew S.
- [ ] UNIX and Linux System Administration Handbook Nemeth, Evi
- [ ] TCP/IP Illustrated, Volume 3: TCP for Transactions, HTTP, NNTP, and the Unix (R) Domain Protocols Stevens, W. Richard
- [ ] Systems Performance: Enterprise and the Cloud
- [ ] The datacenter as a computer: an introduction to the design of warehouse-scale machines
- [ ] The Practice of System and Network Administration
- [ ] The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems
- [ ] Linux Server Hacks: 100 Industrial-Strength Tips and Tools Flickenger, Rob
- [ ] Web Operations - Keeping the Data On Time
Programming
- [ ] The Linux Command Line Jr., William E. Shotts
- [ ] Shell Scripting: How to Automate Command Line Tasks Using Bash Scripting and Shell Programming
- [ ] The Go Programming Language Donovan, Alan A. A.
- [ ] Think Python Downey, Allen B.
- [ ] Programming Pearls Bentley, Jon L.
Other
- [ ] Time Management for System Administrators
Research papers
- [ ] Large-scale cluster management at Google with Borg
- [ ] On designing and deploying internet-scale services
- [ ] Mesos: a platform for fine-grained resource sharing in the data center
- [ ] Google: Reliable Cron across the Planet
Technologies
- [ ] Kubernetes
- [ ] CNCF landscape
- [ ] Aurora
- [ ] Docker
- [ ] Fluentd
- [ ] ElasticSearch
- [ ] Hadoop
- [ ] Mesos
- [ ] Kernel Based Virtual Machine
- [ ] Spark
- [ ] VMWare
SRE best practice
- [ ] Software engineering at Google
- [ ] Keys to SRE by Ben Treynor
- [ ] How Container Clusters Like Kubernetes Change Operations
- [ ] 10 Years of Crashing Google
- [ ] Release Engineering Best Practices at Google
- [ ] From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
- [ ] Transactional System Administration Is Killing Us and Must be Stopped
- [ ] Lessons Learned From Scaling Uber To 2000 Engineers, 1000 Services, And 8000 Git Repositories
- [ ] Netflix: 190 Countries and 5 CORE SREs
- [ ] Performance Checklists for SREs
- [ ] Notes on SRE book
- [ ] SYSADMIN (Un)Reliability Budgets
Trainings
Conferences
- [ ] USENIX SRE conferences
- [ ] Kubecon and Cloud Native
- [ ] PromCon
- [ ] GrafanaCon
- [ ] DockerCon
- [ ] HashiConf
- [ ] DevOpscon