Site Reliability Engineering

Mindset

Try to quantify the syndrome or confirm issues with the customers before deep diving into analysis.

Problem Statement Method
- what makes you think there is a performance problem?
- Has this system ever performed well?
- What has changed recently? (Software? Hardware? Load?)
- Can the performance degradation be expressed in terms of latency or runtime?
- Does the problem affect other people or applications? (Or is it just you?)
- What is the environment?

Understanding Linux

Linux Performance Analysis in 60 seconds

Check load average
```
$ uptime
```
Check kernel errors
```
$ dmesg -T | tail
```
Check overall stats by time
```
$ vmstat 1
```
Check CPU balance
```
$ mpstat -P ALL 1
```
Check process usage
```
$ pidstat  1
```
Check disk I/O
```
$ iostat -xz 1
```
Check memory usage
```
$ free -m
```
Check network I/O
```
$ sar -n DEV 1
```
Check TCP stats
```
$ sar -n TCP,ETCP 1
```
Overview
```
$ top
```

System Performance, Enterprise, and the Cloud

Writeup of the Book: System Performance, Enterprise, and the Cloud

Operating System Knowledge

Process Control Block

Information associated with each process

Process state: running, waiting, etc.
Program counter: location of instruction to next execute.
CPU registers: contents of all process-centric registers.
CPU scheduling information: priorities, scheduling queue pointers.
Memory-management information: memory allocated to the process.
Accounting information: CPU used, clock time elapsed since start, time limits.
I/O status information: I/O devices allocated to process, list of open files.

Zombie and Orphan

Zombie: A process that has terminated, but whose parent has not yet called wait(). when a child exists, some process must wait on it to get its exit code. zombies only occupy space in the process table, take no memory or CPU. However, process table is a finite resource, excessive zombine can fill it up and no other processes can launch.
Orphan: A process that is running, but whose parent has terminated. Orphans are adopted by init.

Swap Space

The primary function of swap space is to substitute disk space for RAM memory when real RAM fills up and more space is needed.

Trashing

Thrashing can occur when total virtual memory, both RAM and swap space, become nearly full. The system spends so much time paging blocks of memory between swap space and RAM and back that little time is left for real work. The typical symptoms of this are obvious: The system becomes slow or completely unresponsive, and the hard drive activity light is on almost constantly.

RAID

RAID stands for Redundant Array of Independent Disks. It's idea is to spread data over multiple drives in parallel to get higher throughput while using parity for robustness.

Raid 0: Stripe data across drives for improved throughput.
- No extra redundancy, elevated risk.
Raid 1: Mirroring, parallel data to multiple devices for robustness.
- No extra throughput.
Raid 2: Use Hamming codes for parity.
- Requires log2 parity bits.
- Really expensive.
Raid 3: Bitwise parity on parity disk.
- Requires only one parity disk for N storage disks.
- Bitwise parity is slow, dedicated parity disk is bottleneck.
Raid 4: Similar to Raid 3, but blockwise parity improves performance.
Raid 5: Rotating parity block among disks relieves bottleneck.
Raid 6: Raid5 with dual parity.
- Supports up to 2 HDD failures.
- Slow rebuild.

Bash Knowledge

Special Variable

Variable	Description
$!	PID (process ID) of last job run in background.
$?	Exit status of a command, function, or the script itself.
$$	PID (Process ID) of the script itself.

Site-Reliability-Engineering
Site-Reliability-Engineering copied to clipboard

Metadata

Site Reliability Engineering

Mindset

Understanding Linux

Linux Performance Analysis in 60 seconds

System Performance, Enterprise, and the Cloud

Operating System Knowledge

Process Control Block

Zombie and Orphan

Swap Space

Trashing

RAID

Bash Knowledge

Special Variable

← Metadata

Owner

Metadata

Site-Reliability-Engineering Site-Reliability-Engineering copied to clipboard

Metadata

Site Reliability Engineering

Mindset

Understanding Linux

Linux Performance Analysis in 60 seconds

System Performance, Enterprise, and the Cloud

Operating System Knowledge

Process Control Block

Zombie and Orphan

Swap Space

Trashing

RAID

Bash Knowledge

Special Variable

← Metadata

Owner

Metadata

Site-Reliability-Engineering
Site-Reliability-Engineering copied to clipboard