og-aws
og-aws copied to clipboard
More on EC2 instance status checks and events
Some coverage in the EC2 section on instance status checks (system reachability, instance reachability) and maybe Events, e.g. "Hey, 200 of your instances are going to be rebooted in 2 days!"
- What status checks are and what they mean
- What kinds of notifications happen and why, and how to plan for them
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html
Occasionally, an instance will have issues that AWS automatically checks -- System and Instance reachability. Status can be seen in the AWS console, or for an instance on the Status Checks tab. A system check failure is due to a case where either the actual machine instance or networking or other problem occurs and must be resolved by AWS. Instance reachability errors occur if the operating system is failing, memory or disk space is exhausted, and others -- typically you can intervene to resolve Instance status check errors.
In many cases, unexplained instance reachability checks are a precursor to a hardware failure, so should be addressed immediately. This problem becomes more common for a given instance generation over time since hardware is likely to fail more frequently. Make sure you have good systems in place to recognize the AWS emails, and set up specific CloudWatch monitoring for critical systems.
Tip/Trick: if you are seeing system reachability failures, this can be a sign that it's time to upgrade your hardware to a more modern generation.
Often, Instance Check failures can be resolved with an instance reboot (restart OS on same hardware), or restart (different hardware, but configured the same). The AWS console will eventually force a reboot in cases where you are not able to log into the instance.
It is important to pay careful attention to emails sent by AWS, as there are several cases where planned events occur. These will also display in the AWS EC2 Console in the Events tab. Cases include occasional required maintenance of the underlying hardware, during which you'll typically get advanced notice. More recently, OS or Xen vulnerabilities will result in a very short window of perhaps several days, or less. The event will typically be a reboot, but may also be a restart.
Tip/Trick: because these events can occur very rapidly, and sometimes with no notice, you should ensure that your system will survive a reboot. Common errors include services that are not configured to start automatically -- when making changes to configuration, be sure to test in advance.
Danger/Warning: Instance Store "ephemeral" drives will survive reboot, but will not survive restart; use EBS, S3, or EFS mounts when practical to ensure that data is not lost.
Tip/Trick: AWS is built around the assumption that machines are replaceable; where possible, configure your instances to meet this assumption. Some systems such as Solr, MongoDB, or Hadoop tend to use instance types such as hs1 and d2 because they have terabytes of instance store drives that are cheaper and faster than EBS. Always use EBS-backed instances for boot and configuration information, as that will survive. If you must store data on instance store drives, make sure that your system knows how to detect and automatically mount and format drives at boot time if not present. While Hadoop and other systems will attempt to replicate data, this is costly and increases risk.
Warning/Danger: it is common for an entire family of machines to need reboot, often due to security issues. You should be prepared to respond quickly if an uncontrolled reboot by AWS will affect your system.