📊 Monitoring & Observability#
Learning Objectives#
- Monitor infrastructure with CloudWatch metrics, logs, and alarms
- Audit API activity with CloudTrail
- Trace requests across services with X-Ray
- Centralize logs with CloudWatch Logs Insights
1. Amazon CloudWatch#
1.1 CloudWatch Metrics#
CloudWatch monitors AWS resources and applications with metrics — time-series data points.
Built-in Metrics (AWS Services):
- EC2: CPUUtilization, NetworkIn, NetworkOut, StatusCheckFailed
- RDS: DatabaseConnections, ReadLatency, WriteLatency
- ALB: RequestCount, TargetResponseTime, HealthyHostCount
- Lambda: Invocations, Duration, Errors, Throttles
- S3: BucketSizeBytes, NumberOfObjects
Custom Metrics:
# Put custom metric (memory usage, disk space, etc.)
aws cloudwatch put-metric-data \
--namespace "Custom/AppMetrics" \
--metric-data '[
{"MetricName": "ActiveUsers", "Value": 542, "Unit": "Count"},
{"MetricName": "ResponseTime", "StatisticValues": { "SampleCount": 100, "Sum": 2500, "Minimum": 10, "Maximum": 85 }}
]'1.2 CloudWatch Alarms#
Alarm States: OK | ALARM | INSUFFICIENT_DATA
Metric → Alarm → SNS Topic → Email, SMS, Lambda, Auto Scaling# Create alarm
aws cloudwatch put-metric-alarm \
--alarm-name "high-cpu-alarm" \
--alarm-description "EC2 CPU > 80% for 5 minutes" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:...:ops-team \
--dimensions Name=InstanceId,Value=i-abc123Composite Alarms — Combine multiple alarms with AND/OR logic:
aws cloudwatch put-composite-alarm \
--alarm-name "high-cpu-or-memory" \
--alarm-rule "ALARM(high-cpu) OR ALARM(high-memory)" \
--alarm-actions arn:aws:sns:...:ops-team1.3 CloudWatch Logs#
Log Groups → Log Streams → Log Events
# Create log group
aws logs create-log-group --log-group-name /app/prod/web
# Set retention
aws logs put-retention-policy \
--log-group-name /app/prod/web \
--retention-in-days 90
# Subscribe log group to Lambda for real-time processing
aws logs put-subscription-filter \
--log-group-name /app/prod/web \
--filter-name error-filter \
--filter-pattern "ERROR" \
--destination-arn arn:aws:lambda:us-east-1:...:function:error-handlerCloudWatch Logs Insights — Query logs with SQL-like syntax:
fields @timestamp, @message
| filter @message like /ERROR|CRITICAL/
| stats count() by @logStream
| sort @timestamp desc
| limit 201.4 CloudWatch Dashboards#
Create custom dashboards with metrics from multiple services:
aws cloudwatch put-dashboard \
--dashboard-name "Production-Overview" \
--dashboard-body '{"widgets": [{ "type": "metric", "properties": { "metrics": [ ["AWS/EC2", "CPUUtilization", {"stat": "Average"}],
["AWS/RDS", "DatabaseConnections", {"stat": "Sum"}]
],
"period": 300,
"stat": "Average",
"region": "us-east-1",
"title": "Production Metrics Overview"
}
}]
}'1.5 CloudWatch Container Insights#
Collect, aggregate, and summarize metrics from ECS, EKS, and Kubernetes:
- CPU, memory, network, disk metrics
- Performance log patterns
- Service-level dashboards
2. AWS CloudTrail#
Purpose: Audit all API calls made in your AWS account.
| Feature | Description |
|---|---|
| Management Events | CRUD on AWS resources (S3 create, EC2 launch) |
| Data Events | S3 object-level, Lambda function invocation |
| Insights Events | Unusual API activity detected by ML |
| Multi-Region Trail | Logs all regions to a single S3 bucket |
# Create multi-region trail
aws cloudtrail create-trail \
--name "organization-trail" \
--s3-bucket-name my-company-cloudtrail-logs \
--is-multi-region-trail \
--is-organization-trail \
--enable-log-file-validation
# Start logging
aws cloudtrail start-logging --name "organization-trail"CloudTrail Log Example:
{"eventVersion": "1.08", "userIdentity": { "type": "IAMUser", "arn": "arn:aws:iam::123456789012:user/admin", "accountId": "123456789012" },
"eventTime": "2024-01-15T14:30:00Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "RunInstances",
"awsRegion": "us-east-1",
"sourceIPAddress": "203.0.113.42",
"userAgent": "console.amazonaws.com",
"requestParameters": {"instanceType": "t3.medium", "imageId": "ami-0abcdef1234567890" },
"responseElements": {"instancesSet": { "items": [{"instanceId": "i-abc123"}]
}
}
}3. AWS X-Ray#
Troubleshoot performance and errors by tracing requests across services:
User Request → API Gateway → Lambda → DynamoDB
│ │ │
└─────────────┴─────────┘
X-Ray TraceKey Concepts:
- Trace — End-to-end path of a request
- Segment — Work done by a single service
- Subsegment — Work done within a service (e.g., DB query)
- Service Map — Visual representation of all services
# Enable X-Ray on Lambda
aws lambda update-function-configuration \
--function-name process-order \
--tracing-config Mode=Active
# X-Ray sampling rules
aws xray put-sampling-rule \
--sampling-rule '{"RuleName": "production-sampling",
"Priority": 1000,
"ReservoirSize": 10,
"FixedRate": 0.1,
"Host": "*", "HTTPMethod": "*", "URLPath": "*", "ServiceName": "*", "ServiceType": "*" }'4. AWS Config#
Track resource configuration changes:
# Enable Config recorder
aws configservice put-configuration-recorder \
--configuration-recorder name=default,roleARN=arn:aws:iam::...:role/aws-config-role
# Enable delivery channel
aws configservice put-delivery-channel \
--delivery-channel name=default,s3BucketName=my-config-bucket
# Start recording
aws configservice start-configuration-recorder --configuration-recorder-name=default5. Monitoring Comparison#
| Service | Purpose | Data Source |
|---|---|---|
| CloudWatch | Performance metrics, logs, alarms | AWS services, custom apps |
| CloudTrail | API audit trail | AWS API calls |
| X-Ray | Request tracing, performance | Application traces |
| Config | Resource configuration changes | AWS resource states |
| VPC Flow Logs | Network traffic logs | VPC network interfaces |
6. Real-World Use Cases#
Use Case 1: Full Observability Stack for a Microservices Application#
Scenario: A company runs a microservices app with 20+ services on ECS Fargate. They need to debug slow API responses, trace requests across services, and get alerted on anomalies.
Solution: CloudWatch + X-Ray + Synthetics
graph TD
subgraph App["Application Layer"]
ALB["ALB"]:::aws
SVC1["User Service"]:::app
SVC2["Order Service"]:::app
SVC3["Payment Service"]:::app
RDS["RDS"]:::aws
SQS["SQS"]:::aws
end
subgraph Observability["Observability Layer"]
CW["CloudWatch\nMetrics + Logs + Alarms"]:::cw
XRAY["X-Ray\nDistributed Tracing"]:::xray
SYNTH["CloudWatch Synthetics\nCanary Monitoring"]:::cw
DASH["CloudWatch Dashboard\nService Overview"]:::cw
end
subgraph Alerting["Alerting Layer"]
SNS["SNS Topic"]:::aws
PAGER["PagerDuty\nOpsGenie"]:::tool
EMAIL["Email + Slack"]:::tool
end
ALB --> SVC1
SVC1 --> SVC2
SVC2 --> SVC3
SVC1 --> RDS
SVC2 --> SQS
ALB -.->|Metrics| CW
SVC1 -.->|Trace| XRAY
SVC2 -.->|Trace| XRAY
SVC3 -.->|Trace| XRAY
XRAY -.->|Service Map| DASH
CW -.->|Alarm| SNS
SYNTH -.->|Synthetic checks| CW
SNS --> PAGER
SNS --> EMAIL
classDef aws fill:#ff9900,color:#fff
classDef app fill:#232f3e,color:#fff
classDef cw fill:#527fff,color:#fff
classDef xray fill:#00a4c7,color:#fff
classDef tool fill:#666,color:#fffImplementation steps:
# Step 1: Enable X-Ray tracing on ECS service
# Add to task definition:
# "environment": [{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000"}]
# Step 2: Create CloudWatch dashboard for service overview
aws cloudwatch put-dashboard --dashboard-name "Microservices-Overview" --dashboard-body '{"widgets": [{ "type": "metric", "properties": { "metrics": [ ["AWS/ApplicationELB", "TargetResponseTime", {"stat": "p95"}],
["ECS/ContainerInsights", "CpuUtilization", {"stat": "Average"}],
["ECS/ContainerInsights", "MemoryUtilization", {"stat": "Average"}]
],
"period": 300,
"title": "Service Health Overview"
}
}]
}'
# Step 3: Create composite alarm for service health
aws cloudwatch put-composite-alarm \
--alarm-name "order-service-unhealthy" \
--alarm-rule "ALARM(order-svc-high-latency) OR ALARM(order-svc-high-error)" \
--alarm-actions arn:aws:sns:...:ops-teamUse Case 2: Centralized Logging for Multi-Account Environment#
Scenario: A company has 10 AWS accounts (dev, staging, prod, security, etc.). They need a centralized logging solution where the security team can search across all accounts.
Solution: Centralized Logging with Kinesis + OpenSearch
graph TD
subgraph Accounts["Source Accounts"]
PROD["Production\nAccount"]
STAGING["Staging\nAccount"]
DEV["Development\nAccount"]
SEC["Security\nAccount"]
end
subgraph Ingestion["Central Logging Account"]
CF["CloudFront Logs"]
CT["CloudTrail\n(Organization Trail)"]
CW_LOG["CloudWatch Logs\nCross-account Subscription\nFilter"]
KDF["Kinesis Data\nFirehose"]
LAMBDA["Lambda\nTransform/Enrich"]
end
subgraph Storage["Storage & Query"]
S3_RAW["S3 - Raw Logs\n(Parquet)"]
OS["OpenSearch Service\nInteractive Search"]
ATHENA["Athena\nAd-hoc SQL Queries"]
end
subgraph Visualization["Visualization"]
GRAFANA["Grafana\nDashboards"]
QUICKSIGHT["QuickSight\nSecurity Reports"]
end
PROD --> CW_LOG
STAGING --> CW_LOG
DEV --> CW_LOG
SEC --> CT
CT --> KDF
CW_LOG --> KDF
KDF --> LAMBDA
LAMBDA --> S3_RAW
LAMBDA --> OS
S3_RAW --> ATHENA
OS --> GRAFANA
ATHENA --> QUICKSIGHTCross-account log subscription setup:
# In source account: create subscription filter pointing to central account
aws logs put-subscription-filter \
--log-group-name "/aws/lambda/prod-app" \
--filter-name "central-logging" \
--filter-pattern "" \
--destination-arn "arn:aws:logs:us-east-1:CENTRAL_ACCOUNT:destination:all-logs"
# Organization CloudTrail (single trail for all accounts)
aws cloudtrail create-trail \
--name "org-trail" \
--s3-bucket-name "central-account-logs-bucket" \
--is-organization-trail \
--is-multi-region-trailUse Case 3: Cost Anomaly Detection with CloudWatch#
Scenario: The finance team wants to detect unusual spikes in AWS spending before they become a surprise bill.
Solution: CloudWatch Anomaly Detection + Budget Alarms
graph LR
subgraph Training["ML Model Training"]
CW_METRICS["CloudWatch Metrics\nDaily spend data (90 days)"]
ANOMALY["CloudWatch Anomaly Detection\nBand: Expected +/- 2 std dev"]
end
subgraph Monitoring["Real-time Monitoring"]
CURRENT["Current day spend metric"]
DETECT["Detect deviation\noutside expected band"]
end
subgraph Alert["Alerting"]
ALARM["CloudWatch Alarm\nANOMALY_DETECTION_BAND"]
SNS["SNS → Email/Slack"]
LAMBDA["Lambda\nAuto-suspend resources"]
end
CW_METRICS --> ANOMALY
CURRENT --> DETECT
ANOMALY --> DETECT
DETECT --> ALARM
ALARM --> SNS
ALARM --> LAMBDA# Create anomaly detection alarm for estimated charges
aws cloudwatch put-metric-alarm \
--alarm-name "cost-anomaly-alarm" \
--alarm-description "Alert if daily spend is outside expected range" \
--metric-name EstimatedCharges \
--namespace AWS/Billing \
--statistic Maximum \
--period 86400 \
--evaluation-periods 1 \
--threshold-metric-id "ad1" \
--metrics '[
{"Id": "m1", "MetricStat": { "Metric": { "Namespace": "AWS/Billing", "MetricName": "EstimatedCharges" },
"Period": 86400,
"Stat": "Maximum"
},
"ReturnData": true
},
{"Id": "ad1", "Expression": "ANOMALY_DETECTION_BAND(m1, 2)", "Label": "Expected spend (2 std dev)", "ReturnData": true }
]'Takeaway: CloudWatch Anomaly Detection uses ML to automatically establish a baseline and alert on deviations. Combined with AWS Budgets (forecasted alerts at 80%, 100%), you get proactive cost control.
✅ Chapter Quiz#
-
Which CloudWatch alarm state means insufficient data is available?
- A) OK
- B) ALARM
- C) INSUFFICIENT_DATA
- D) UNKNOWN
-
Which service records who deleted an S3 bucket?
- A) CloudWatch
- B) CloudTrail
- C) Config
- D) X-Ray
-
What is the purpose of X-Ray service maps?
- A) Track resource configuration changes
- B) Visual representation of service dependencies
- C) Monitor EC2 CPU usage
- D) Store log files
-
Which CloudWatch feature allows SQL-like querying of logs?
- A) Log Groups
- B) Log Streams
- C) Logs Insights
- D) Metric Filters
-
How long can you retain CloudWatch Logs?
- A) 30 days
- B) 90 days
- C) 365 days
- D) Indefinitely (up to 10 years)
-
Which CloudWatch feature allows you to run code to simulate user behavior and monitor application health?
- A) CloudWatch Logs
- B) CloudWatch Synthetics
- C) CloudWatch Metrics
- D) CloudWatch Alarms
-
What is the purpose of an X-Ray segment?
- A) To store log data
- B) To represent the work done by a single service in a trace
- C) To monitor EC2 CPU usage
- D) To create dashboards
-
Which AWS service provides a centralized view of resource configuration compliance?
- A) CloudTrail
- B) Config
- C) CloudWatch
- D) Service Catalog
-
A company wants to collect operating system-level metrics from EC2 instances, including memory usage and disk space. How should they collect this data?
- A) CloudWatch default metrics
- B) CloudWatch agent
- C) CloudTrail
- D) VPC Flow Logs
-
What does a CloudWatch composite alarm allow you to do?
- A) Create alarms across multiple regions
- B) Combine multiple alarms using AND/OR logic
- C) Monitor billing metrics
- D) Create dashboards
-
Which X-Ray feature provides a visual representation of service dependencies?
- A) Traces
- B) Service maps
- C) Segments
- D) Sampling rules
-
A company needs to monitor network traffic to and from EC2 instances for security analysis. Which service should they use?
- A) CloudWatch
- B) CloudTrail
- C) VPC Flow Logs
- D) Config
-
What is the default retention period for CloudTrail management events?
- A) 30 days
- B) 90 days
- C) 365 days
- D) 7 years
-
Which service would you use to get a unified view of health and performance across all AWS accounts in an organization?
- A) CloudWatch Dashboard
- B) CloudWatch cross-account observability
- C) AWS Organizations
- D) Service Catalog
-
What is the purpose of CloudWatch Logs subscription filter?
- A) To filter logs in the console
- B) To forward log entries matching a pattern to another service in real time
- C) To set retention policies
- D) To export logs to S3
-
A company wants to trace a user request as it travels through an API Gateway, Lambda function, and DynamoDB. Which service provides this end-to-end tracing?
- A) CloudWatch
- B) CloudTrail
- C) X-Ray
- D) Config
-
What is the difference between a CloudWatch metric and a CloudWatch log?
- A) Metrics are time-series data, logs are event records with timestamps
- B) Logs are numerical, metrics are textual
- C) Metrics are free, logs are always paid
- D) There is no difference
-
Which CloudWatch Logs feature allows you to run SQL-like queries across multiple log groups?
- A) Metric filters
- B) Logs Insights
- C) Subscription filters
- D) Contributor Insights
-
A company needs to detect and investigate unusual API activity in their AWS account. Which CloudTrail feature should they enable?
- A) Management events
- B) Data events
- C) Insights
- D) Multi-region trail
-
What is the purpose of X-Ray sampling rules?
- A) To reduce the number of traces recorded to control costs
- B) To sample logs for analysis
- C) To select which API calls to record
- D) To filter CloudWatch metrics
-
Which AWS service provides automated vulnerability management for EC2 instances and container images?
- A) GuardDuty
- B) Inspector
- C) Security Hub
- D) Config
-
A company wants to be notified if an EC2 instance’s CPU utilization exceeds 90% for 5 consecutive minutes. What should they create?
- A) CloudWatch Logs
- B) CloudWatch Alarm
- C) CloudTrail
- D) Config rule
-
What is the purpose of CloudWatch Container Insights?
- A) To monitor ECS and EKS container metrics
- B) To scan container images for vulnerabilities
- C) To manage container deployments
- D) To store container logs
-
Which AWS service provides a managed Elasticsearch-compatible search and analytics engine for log analysis?
- A) CloudWatch Logs
- B) OpenSearch Service
- C) Athena
- D) Kinesis Data Analytics
-
A company wants to create a dashboard showing real-time operational metrics from multiple AWS accounts in a single view. Which CloudWatch feature supports this?
- A) CloudWatch cross-account dashboards
- B) CloudWatch Logs Insights
- C) CloudWatch Synthetics
- D) CloudWatch Contributor Insights
📝 Answer Key
- C — INSUFFICIENT_DATA means there’s not enough data to determine state.
- B — CloudTrail records all API calls including S3 DeleteBucket.
- B — X-Ray service maps show service dependencies and performance.
- C — CloudWatch Logs Insights uses a SQL-like query language.
- D — CloudWatch Logs can be retained indefinitely (max 10 years).
- B — CloudWatch Synthetics creates canaries that run configurable scripts to simulate user flows.
- B — An X-Ray segment records the work done by a single service within a trace.
- B — AWS Config tracks resource configuration changes and evaluates compliance continuously.
- B — The CloudWatch agent collects OS-level metrics like memory and disk usage from EC2 instances.
- B — Composite alarms combine multiple child alarms using AND/OR logic for complex alerting.
- B — X-Ray service maps visualize service dependencies and performance bottlenecks.
- C — VPC Flow Logs capture IP traffic information for network security analysis.
- B — CloudTrail management events are retained for 90 days by default in the event history.
- B — CloudWatch cross-account observability provides a unified monitoring view across accounts.
- B — Subscription filters forward log entries matching a pattern to destinations in real time.
- C — X-Ray traces end-to-end requests across microservices and AWS services.
- A — CloudWatch Metrics are numerical time-series data; Logs contain event records.
- B — CloudWatch Logs Insights uses a SQL-like query language to analyze log data.
- C — CloudTrail Insights uses ML to detect unusual API activity and generate insights events.
- A — Sampling rules define how much trace data to record, balancing detail with cost.
- B — Amazon Inspector continuously scans EC2 instances and container images for vulnerabilities.
- B — A CloudWatch Alarm monitors a metric and triggers an action when a threshold is breached.
- A — Container Insights collects CPU, memory, network, and disk metrics from ECS and EKS.
- B — OpenSearch Service provides a managed Elasticsearch-compatible engine for log analytics.
- A — CloudWatch cross-account dashboards aggregate metrics from multiple accounts in one view.
📚 Additional Resources#
Next → Migration & Hybrid