CloudWatch Alarms

CloudWatch Alarms watch a metric and trigger actions when the metric crosses a threshold for a specified number of evaluation periods.

Core Concepts

Alarm States

OK         → Metric is within threshold
INSUFFICIENT_DATA → Metric not available or not enough data
ALARM     → Metric breached threshold for N consecutive periods

Alarm Evaluation

Evaluation period: 1 minute (shortest possible)
Data points to alarm: 3 (consecutive breaches)
Period: 1 minute

Alarm triggers when:
  Minute 1: CPU > 80%   (1 of 3)
  Minute 2: CPU > 80%   (2 of 3)
  Minute 3: CPU > 80%   (3 of 3) → ALARM

Creating an Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name HighCPU \
  --alarm-description "Alert when CPU exceeds 80% for 3 consecutive minutes" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 60 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:my-alert-topic \
  --dimensions Name=InstanceId,Value=i-xxxxx

Alarm Actions

Action	Use
SNS Topic	Send notification (email, SMS, PagerDuty)
Auto Scaling	Scale ASG in/out
EC2 Action	Stop, terminate, or reboot EC2
Systems Manager OpsItem	Create OpsItem for runbook automation

aws cloudwatch put-metric-alarm \
  --alarm-name HighCPU-AutoScale \
  --alarm-description "Scale out ASG when CPU is high" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 60 \
  --threshold 70 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions \
    arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:abc123:autoScalingGroupName:my-asg:policyName:scale-out

Composite Alarms

Composite alarms evaluate multiple alarms together using boolean logic:

aws cloudwatch put-composite-alarm \
  --alarm-name ServiceDown \
  --alarm-rule "(ALARM HighCPU OR ALARM HighMemory) AND ALARM HighNetwork" \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:my-alert-topic

This reduces alarm noise — a page only fires when multiple conditions are met, not when each individual alarm fires separately.

Anomaly Detection

CloudWatch Anomaly Detection uses machine learning to establish a “normal” baseline and alert on deviations:

aws cloudwatch put-metric-alarm \
  --alarm-name HighLatency-Anomaly \
  --alarm-description "Alert when p99 latency is unusually high" \
  --metric-name TargetResponseTime \
  --namespace AWS/ApplicationELB \
  --statistic p99 \
  --period 300 \
  --threshold 2.5 \
  --comparison-operator GreaterThanUpperThreshold \
  --evaluation-periods 2 \
  --treat-missing-data notBreaching \
  --metrics '[{"Id":"m1","MetricStat":{"Metric":{"Namespace":"AWS/ApplicationELB","MetricName":"TargetResponseTime","Period":300,"Stat":"p99"},"ReturnData":false}}]' \
  --enable-metric-math

Alarm Configuration Options

Missing Data Treatment

Option	Behavior
`notBreaching` (default)	Missing data treated as “good” — no alarm
`breaching`	Missing data treated as “breached” — alarm
`ignore`	Missing data doesn’t affect alarm state
`missing`	Alarm stays in current state

Evaluation Periods and Period Length

Scenario	Period	Evaluation Periods
Real-time (1-min detection)	60	3
Standard (5-min detection)	300	2
Cost-optimized (15-min detection)	900	2

Alarm Best Practices

□ Use composite alarms to reduce noise — combine related conditions
□ Set alarm actions to send to SNS (email/SMS) and OpsItem (runbook)
□ Use GetMetricData for alarms on multiple metrics (cheaper than multiple alarms)
□ Anomaly detection for metrics with seasonal variation (e.g., traffic peaks)
□ Set treat-missing-data appropriately — don't page for missing data from a dev instance
□ Use 1-minute alarms for critical services, 5-minute for non-critical
□ Always test alarms by manually triggering them (aws cloudwatch set-alarm-state)

AWS Service Alarm Patterns

EC2 Instance Alarms

# CPU
aws cloudwatch put-metric-alarm --alarm-name HighCPU --metric-name CPUUtilization \
  --namespace AWS/EC2 --statistic Average --period 60 --threshold 80 \
  --evaluation-periods 3 --comparison-operator GreaterThanThreshold
 
# Status check
aws cloudwatch put-metric-alarm --alarm-name InstanceStatus \
  --namespace AWS/EC2 --metric-name StatusCheckFailed \
  --statistic Maximum --period 60 --threshold 1 \
  --evaluation-periods 1 --comparison-operator GreaterThanThreshold

ALB Alarm

# Unhealthy hosts
aws cloudwatch put-metric-alarm --alarm-name UnhealthyHosts \
  --namespace AWS/ApplicationELB --metric-name UnHealthyHostCount \
  --statistic Maximum --period 60 --threshold 1 \
  --evaluation-periods 2 --comparison-operator GreaterThanThreshold

Limits

Resource	Limit
Alarms per region	10,000 (can request increase)
Alarm actions per alarm	5
Composite alarm depth	5 nested alarms
Metrics per alarm	1 (use metric math for multiple)

References

Homepage: https://aws.amazon.com/cloudwatch/
Documentation: https://docs.aws.amazon.com/cloudwatch/
Pricing: https://aws.amazon.com/cloudwatch/pricing/

Pricing Examples

Scenario 1: A production system with 10 alarms (CPU, memory, disk, network per instance, 50 instances). 500 alarms total. At $0.10/ a l a r m / m o n t h =$ 50/month. Plus SNS notifications (negligible cost). Total: ~$50/month. Without alarms, a CPU spike goes unnoticed for hours, causing customer impact.

Scenario 2: Using anomaly detection on ALB request count. The service has daily and weekly seasonality — peak at 9am, low at 2am. A fixed threshold alarm would false-positive constantly. Anomaly detection learns the pattern and only fires when traffic is outside the learned range. Anomaly detection: $0.30/ a l a r m / m o n t h \times 5 a l a r m s =$ 1.50/month for more accurate alerting.

Nuggets & Gotchas

The minimum alarm evaluation period is 10 seconds — not 1 second: Even if you set --period 1, CloudWatch rounds it to 10 seconds. For sub-10-second alerting, you need a different approach (CloudWatch Contributor Insights for near-real-time, or a third-party monitoring tool).
Alarms go to INSUFFICIENT_DATA when the instance is stopped/terminated: A CPU alarm on an instance that gets stopped goes to INSUFFICIENT_DATA. If you have treat-missing-data: breaching, the alarm fires when instances are stopped (unwanted). Use treat-missing-data: notBreaching for instance-level alarms.
Composite alarms with OR conditions can still page too much: If you have ALARM HighCPU OR ALARM HighMemory, and HighCPU fires every hour during peak traffic, you’ll get a page every hour. Consider using a composite with AND conditions or a time-based suppression (CloudWatch Events rule that mutes the alarm for a period after it fires).
Alarm actions are not retried if SNS fails: If the SNS topic is misconfigured or rate-limited, the alarm action silently fails. There’s no built-in retry. Use dead-letter queues or Lambda-based fan-out for critical notifications.
You cannot create an alarm on a metric that doesn’t exist yet: CloudWatch only creates metrics when data is first put. If you create an alarm on a metric that your application hasn’t emitted yet, the alarm goes to INSUFFICIENT_DATA and stays there until the metric starts emitting.

cloudnative wiki

Explorer

CloudWatch Alarms

CloudWatch Alarms

Core Concepts

Alarm States

Alarm Evaluation

Creating an Alarm

Alarm Actions

Composite Alarms

Anomaly Detection

Alarm Configuration Options

Missing Data Treatment

Evaluation Periods and Period Length

Alarm Best Practices

AWS Service Alarm Patterns

EC2 Instance Alarms

ALB Alarm

Limits

References

Pricing Examples

Nuggets & Gotchas

Graph View

Table of Contents