📊 Monitoring & Observability#

Learning Objectives#

Monitor infrastructure with CloudWatch metrics, logs, and alarms
Audit API activity with CloudTrail
Trace requests across services with X-Ray
Centralize logs with CloudWatch Logs Insights

1. Amazon CloudWatch#

1.1 CloudWatch Metrics#

CloudWatch monitors AWS resources and applications with metrics — time-series data points.

Built-in Metrics (AWS Services):

EC2: CPUUtilization, NetworkIn, NetworkOut, StatusCheckFailed
RDS: DatabaseConnections, ReadLatency, WriteLatency
ALB: RequestCount, TargetResponseTime, HealthyHostCount
Lambda: Invocations, Duration, Errors, Throttles
S3: BucketSizeBytes, NumberOfObjects

Custom Metrics:

# Put custom metric (memory usage, disk space, etc.)
aws cloudwatch put-metric-data \
  --namespace "Custom/AppMetrics" \
  --metric-data '[
    {"MetricName": "ActiveUsers", "Value": 542, "Unit": "Count"},
    {"MetricName": "ResponseTime", "StatisticValues": { "SampleCount": 100, "Sum": 2500, "Minimum": 10, "Maximum": 85 }}
  ]'

1.2 CloudWatch Alarms#

Alarm States: OK | ALARM | INSUFFICIENT_DATA

Metric → Alarm → SNS Topic → Email, SMS, Lambda, Auto Scaling

# Create alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "high-cpu-alarm" \
  --alarm-description "EC2 CPU > 80% for 5 minutes" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:...:ops-team \
  --dimensions Name=InstanceId,Value=i-abc123

Composite Alarms — Combine multiple alarms with AND/OR logic:

aws cloudwatch put-composite-alarm \
  --alarm-name "high-cpu-or-memory" \
  --alarm-rule "ALARM(high-cpu) OR ALARM(high-memory)" \
  --alarm-actions arn:aws:sns:...:ops-team

1.3 CloudWatch Logs#

Log Groups → Log Streams → Log Events

# Create log group
aws logs create-log-group --log-group-name /app/prod/web

# Set retention
aws logs put-retention-policy \
  --log-group-name /app/prod/web \
  --retention-in-days 90

# Subscribe log group to Lambda for real-time processing
aws logs put-subscription-filter \
  --log-group-name /app/prod/web \
  --filter-name error-filter \
  --filter-pattern "ERROR" \
  --destination-arn arn:aws:lambda:us-east-1:...:function:error-handler

CloudWatch Logs Insights — Query logs with SQL-like syntax:

fields @timestamp, @message
| filter @message like /ERROR|CRITICAL/
| stats count() by @logStream
| sort @timestamp desc
| limit 20

1.4 CloudWatch Dashboards#

Create custom dashboards with metrics from multiple services:

aws cloudwatch put-dashboard \
  --dashboard-name "Production-Overview" \
  --dashboard-body '{"widgets": [{ "type": "metric", "properties": { "metrics": [ ["AWS/EC2", "CPUUtilization", {"stat": "Average"}],
          ["AWS/RDS", "DatabaseConnections", {"stat": "Sum"}]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1",
        "title": "Production Metrics Overview"
      }
    }]
  }'

1.5 CloudWatch Container Insights#

Collect, aggregate, and summarize metrics from ECS, EKS, and Kubernetes:

CPU, memory, network, disk metrics
Performance log patterns
Service-level dashboards

2. AWS CloudTrail#

Purpose: Audit all API calls made in your AWS account.

Feature	Description
Management Events	CRUD on AWS resources (S3 create, EC2 launch)
Data Events	S3 object-level, Lambda function invocation
Insights Events	Unusual API activity detected by ML
Multi-Region Trail	Logs all regions to a single S3 bucket

# Create multi-region trail
aws cloudtrail create-trail \
  --name "organization-trail" \
  --s3-bucket-name my-company-cloudtrail-logs \
  --is-multi-region-trail \
  --is-organization-trail \
  --enable-log-file-validation

# Start logging
aws cloudtrail start-logging --name "organization-trail"

CloudTrail Log Example:

{"eventVersion": "1.08", "userIdentity": { "type": "IAMUser", "arn": "arn:aws:iam::123456789012:user/admin", "accountId": "123456789012" },
  "eventTime": "2024-01-15T14:30:00Z",
  "eventSource": "ec2.amazonaws.com",
  "eventName": "RunInstances",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "203.0.113.42",
  "userAgent": "console.amazonaws.com",
  "requestParameters": {"instanceType": "t3.medium", "imageId": "ami-0abcdef1234567890" },
  "responseElements": {"instancesSet": { "items": [{"instanceId": "i-abc123"}]
    }
  }
}

3. AWS X-Ray#

Troubleshoot performance and errors by tracing requests across services:

User Request → API Gateway → Lambda → DynamoDB
                  │             │         │
                  └─────────────┴─────────┘
                          X-Ray Trace

Key Concepts:

Trace — End-to-end path of a request
Segment — Work done by a single service
Subsegment — Work done within a service (e.g., DB query)
Service Map — Visual representation of all services

# Enable X-Ray on Lambda
aws lambda update-function-configuration \
  --function-name process-order \
  --tracing-config Mode=Active

# X-Ray sampling rules
aws xray put-sampling-rule \
  --sampling-rule '{"RuleName": "production-sampling",
    "Priority": 1000,
    "ReservoirSize": 10,
    "FixedRate": 0.1,
    "Host": "*", "HTTPMethod": "*", "URLPath": "*", "ServiceName": "*", "ServiceType": "*" }'

4. AWS Config#

Track resource configuration changes:

# Enable Config recorder
aws configservice put-configuration-recorder \
  --configuration-recorder name=default,roleARN=arn:aws:iam::...:role/aws-config-role

# Enable delivery channel
aws configservice put-delivery-channel \
  --delivery-channel name=default,s3BucketName=my-config-bucket

# Start recording
aws configservice start-configuration-recorder --configuration-recorder-name=default

5. Monitoring Comparison#

Service	Purpose	Data Source
CloudWatch	Performance metrics, logs, alarms	AWS services, custom apps
CloudTrail	API audit trail	AWS API calls
X-Ray	Request tracing, performance	Application traces
Config	Resource configuration changes	AWS resource states
VPC Flow Logs	Network traffic logs	VPC network interfaces

6. Real-World Use Cases#

Use Case 1: Full Observability Stack for a Microservices Application#

Scenario: A company runs a microservices app with 20+ services on ECS Fargate. They need to debug slow API responses, trace requests across services, and get alerted on anomalies.

Solution: CloudWatch + X-Ray + Synthetics

graph TD
    subgraph App["Application Layer"]
        ALB["ALB"]:::aws
        SVC1["User Service"]:::app
        SVC2["Order Service"]:::app
        SVC3["Payment Service"]:::app
        RDS["RDS"]:::aws
        SQS["SQS"]:::aws
    end
    
    subgraph Observability["Observability Layer"]
        CW["CloudWatch\nMetrics + Logs + Alarms"]:::cw
        XRAY["X-Ray\nDistributed Tracing"]:::xray
        SYNTH["CloudWatch Synthetics\nCanary Monitoring"]:::cw
        DASH["CloudWatch Dashboard\nService Overview"]:::cw
    end
    
    subgraph Alerting["Alerting Layer"]
        SNS["SNS Topic"]:::aws
        PAGER["PagerDuty\nOpsGenie"]:::tool
        EMAIL["Email + Slack"]:::tool
    end
    
    ALB --> SVC1
    SVC1 --> SVC2
    SVC2 --> SVC3
    SVC1 --> RDS
    SVC2 --> SQS
    
    ALB -.->|Metrics| CW
    SVC1 -.->|Trace| XRAY
    SVC2 -.->|Trace| XRAY
    SVC3 -.->|Trace| XRAY
    XRAY -.->|Service Map| DASH
    CW -.->|Alarm| SNS
    SYNTH -.->|Synthetic checks| CW
    SNS --> PAGER
    SNS --> EMAIL
    
    classDef aws fill:#ff9900,color:#fff
    classDef app fill:#232f3e,color:#fff
    classDef cw fill:#527fff,color:#fff
    classDef xray fill:#00a4c7,color:#fff
    classDef tool fill:#666,color:#fff

Implementation steps:

# Step 1: Enable X-Ray tracing on ECS service
# Add to task definition:
# "environment": [{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000"}]

# Step 2: Create CloudWatch dashboard for service overview
aws cloudwatch put-dashboard --dashboard-name "Microservices-Overview" --dashboard-body '{"widgets": [{ "type": "metric", "properties": { "metrics": [ ["AWS/ApplicationELB", "TargetResponseTime", {"stat": "p95"}],
        ["ECS/ContainerInsights", "CpuUtilization", {"stat": "Average"}],
        ["ECS/ContainerInsights", "MemoryUtilization", {"stat": "Average"}]
      ],
      "period": 300,
      "title": "Service Health Overview"
    }
  }]
}'

# Step 3: Create composite alarm for service health
aws cloudwatch put-composite-alarm \
  --alarm-name "order-service-unhealthy" \
  --alarm-rule "ALARM(order-svc-high-latency) OR ALARM(order-svc-high-error)" \
  --alarm-actions arn:aws:sns:...:ops-team

Use Case 2: Centralized Logging for Multi-Account Environment#

Scenario: A company has 10 AWS accounts (dev, staging, prod, security, etc.). They need a centralized logging solution where the security team can search across all accounts.

Solution: Centralized Logging with Kinesis + OpenSearch

graph TD
    subgraph Accounts["Source Accounts"]
        PROD["Production\nAccount"]
        STAGING["Staging\nAccount"]
        DEV["Development\nAccount"]
        SEC["Security\nAccount"]
    end
    
    subgraph Ingestion["Central Logging Account"]
        CF["CloudFront Logs"]
        CT["CloudTrail\n(Organization Trail)"]
        CW_LOG["CloudWatch Logs\nCross-account Subscription\nFilter"]
        KDF["Kinesis Data\nFirehose"]
        LAMBDA["Lambda\nTransform/Enrich"]
    end
    
    subgraph Storage["Storage & Query"]
        S3_RAW["S3 - Raw Logs\n(Parquet)"]
        OS["OpenSearch Service\nInteractive Search"]
        ATHENA["Athena\nAd-hoc SQL Queries"]
    end
    
    subgraph Visualization["Visualization"]
        GRAFANA["Grafana\nDashboards"]
        QUICKSIGHT["QuickSight\nSecurity Reports"]
    end
    
    PROD --> CW_LOG
    STAGING --> CW_LOG
    DEV --> CW_LOG
    SEC --> CT
    CT --> KDF
    CW_LOG --> KDF
    KDF --> LAMBDA
    LAMBDA --> S3_RAW
    LAMBDA --> OS
    S3_RAW --> ATHENA
    OS --> GRAFANA
    ATHENA --> QUICKSIGHT

Cross-account log subscription setup:

# In source account: create subscription filter pointing to central account
aws logs put-subscription-filter \
  --log-group-name "/aws/lambda/prod-app" \
  --filter-name "central-logging" \
  --filter-pattern "" \
  --destination-arn "arn:aws:logs:us-east-1:CENTRAL_ACCOUNT:destination:all-logs"

# Organization CloudTrail (single trail for all accounts)
aws cloudtrail create-trail \
  --name "org-trail" \
  --s3-bucket-name "central-account-logs-bucket" \
  --is-organization-trail \
  --is-multi-region-trail

Use Case 3: Cost Anomaly Detection with CloudWatch#

Scenario: The finance team wants to detect unusual spikes in AWS spending before they become a surprise bill.

Solution: CloudWatch Anomaly Detection + Budget Alarms

graph LR
    subgraph Training["ML Model Training"]
        CW_METRICS["CloudWatch Metrics\nDaily spend data (90 days)"]
        ANOMALY["CloudWatch Anomaly Detection\nBand: Expected +/- 2 std dev"]
    end
    
    subgraph Monitoring["Real-time Monitoring"]
        CURRENT["Current day spend metric"]
        DETECT["Detect deviation\noutside expected band"]
    end
    
    subgraph Alert["Alerting"]
        ALARM["CloudWatch Alarm\nANOMALY_DETECTION_BAND"]
        SNS["SNS → Email/Slack"]
        LAMBDA["Lambda\nAuto-suspend resources"]
    end
    
    CW_METRICS --> ANOMALY
    CURRENT --> DETECT
    ANOMALY --> DETECT
    DETECT --> ALARM
    ALARM --> SNS
    ALARM --> LAMBDA

# Create anomaly detection alarm for estimated charges
aws cloudwatch put-metric-alarm \
  --alarm-name "cost-anomaly-alarm" \
  --alarm-description "Alert if daily spend is outside expected range" \
  --metric-name EstimatedCharges \
  --namespace AWS/Billing \
  --statistic Maximum \
  --period 86400 \
  --evaluation-periods 1 \
  --threshold-metric-id "ad1" \
  --metrics '[
    {"Id": "m1", "MetricStat": { "Metric": { "Namespace": "AWS/Billing", "MetricName": "EstimatedCharges" },
        "Period": 86400,
        "Stat": "Maximum"
      },
      "ReturnData": true
    },
    {"Id": "ad1", "Expression": "ANOMALY_DETECTION_BAND(m1, 2)", "Label": "Expected spend (2 std dev)", "ReturnData": true }
  ]'

Takeaway: CloudWatch Anomaly Detection uses ML to automatically establish a baseline and alert on deviations. Combined with AWS Budgets (forecasted alerts at 80%, 100%), you get proactive cost control.

✅ Chapter Quiz#

Which CloudWatch alarm state means insufficient data is available?
- A) OK
- B) ALARM
- C) INSUFFICIENT_DATA
- D) UNKNOWN
Which service records who deleted an S3 bucket?
- A) CloudWatch
- B) CloudTrail
- C) Config
- D) X-Ray
What is the purpose of X-Ray service maps?
- A) Track resource configuration changes
- B) Visual representation of service dependencies
- C) Monitor EC2 CPU usage
- D) Store log files
Which CloudWatch feature allows SQL-like querying of logs?
- A) Log Groups
- B) Log Streams
- C) Logs Insights
- D) Metric Filters
How long can you retain CloudWatch Logs?
- A) 30 days
- B) 90 days
- C) 365 days
- D) Indefinitely (up to 10 years)
Which CloudWatch feature allows you to run code to simulate user behavior and monitor application health?
- A) CloudWatch Logs
- B) CloudWatch Synthetics
- C) CloudWatch Metrics
- D) CloudWatch Alarms
What is the purpose of an X-Ray segment?
- A) To store log data
- B) To represent the work done by a single service in a trace
- C) To monitor EC2 CPU usage
- D) To create dashboards
Which AWS service provides a centralized view of resource configuration compliance?
- A) CloudTrail
- B) Config
- C) CloudWatch
- D) Service Catalog
A company wants to collect operating system-level metrics from EC2 instances, including memory usage and disk space. How should they collect this data?
- A) CloudWatch default metrics
- B) CloudWatch agent
- C) CloudTrail
- D) VPC Flow Logs
What does a CloudWatch composite alarm allow you to do?
- A) Create alarms across multiple regions
- B) Combine multiple alarms using AND/OR logic
- C) Monitor billing metrics
- D) Create dashboards
Which X-Ray feature provides a visual representation of service dependencies?
- A) Traces
- B) Service maps
- C) Segments
- D) Sampling rules
A company needs to monitor network traffic to and from EC2 instances for security analysis. Which service should they use?
- A) CloudWatch
- B) CloudTrail
- C) VPC Flow Logs
- D) Config
What is the default retention period for CloudTrail management events?
- A) 30 days
- B) 90 days
- C) 365 days
- D) 7 years
Which service would you use to get a unified view of health and performance across all AWS accounts in an organization?
- A) CloudWatch Dashboard
- B) CloudWatch cross-account observability
- C) AWS Organizations
- D) Service Catalog
What is the purpose of CloudWatch Logs subscription filter?
- A) To filter logs in the console
- B) To forward log entries matching a pattern to another service in real time
- C) To set retention policies
- D) To export logs to S3
A company wants to trace a user request as it travels through an API Gateway, Lambda function, and DynamoDB. Which service provides this end-to-end tracing?
- A) CloudWatch
- B) CloudTrail
- C) X-Ray
- D) Config
What is the difference between a CloudWatch metric and a CloudWatch log?
- A) Metrics are time-series data, logs are event records with timestamps
- B) Logs are numerical, metrics are textual
- C) Metrics are free, logs are always paid
- D) There is no difference
Which CloudWatch Logs feature allows you to run SQL-like queries across multiple log groups?
- A) Metric filters
- B) Logs Insights
- C) Subscription filters
- D) Contributor Insights
A company needs to detect and investigate unusual API activity in their AWS account. Which CloudTrail feature should they enable?
- A) Management events
- B) Data events
- C) Insights
- D) Multi-region trail
What is the purpose of X-Ray sampling rules?
- A) To reduce the number of traces recorded to control costs
- B) To sample logs for analysis
- C) To select which API calls to record
- D) To filter CloudWatch metrics
Which AWS service provides automated vulnerability management for EC2 instances and container images?
- A) GuardDuty
- B) Inspector
- C) Security Hub
- D) Config
A company wants to be notified if an EC2 instance’s CPU utilization exceeds 90% for 5 consecutive minutes. What should they create?
- A) CloudWatch Logs
- B) CloudWatch Alarm
- C) CloudTrail
- D) Config rule
What is the purpose of CloudWatch Container Insights?
- A) To monitor ECS and EKS container metrics
- B) To scan container images for vulnerabilities
- C) To manage container deployments
- D) To store container logs
Which AWS service provides a managed Elasticsearch-compatible search and analytics engine for log analysis?
- A) CloudWatch Logs
- B) OpenSearch Service
- C) Athena
- D) Kinesis Data Analytics
A company wants to create a dashboard showing real-time operational metrics from multiple AWS accounts in a single view. Which CloudWatch feature supports this?
- A) CloudWatch cross-account dashboards
- B) CloudWatch Logs Insights
- C) CloudWatch Synthetics
- D) CloudWatch Contributor Insights

📝 Answer Key

C — INSUFFICIENT_DATA means there’s not enough data to determine state.
B — CloudTrail records all API calls including S3 DeleteBucket.
B — X-Ray service maps show service dependencies and performance.
C — CloudWatch Logs Insights uses a SQL-like query language.
D — CloudWatch Logs can be retained indefinitely (max 10 years).
B — CloudWatch Synthetics creates canaries that run configurable scripts to simulate user flows.
B — An X-Ray segment records the work done by a single service within a trace.
B — AWS Config tracks resource configuration changes and evaluates compliance continuously.
B — The CloudWatch agent collects OS-level metrics like memory and disk usage from EC2 instances.
B — Composite alarms combine multiple child alarms using AND/OR logic for complex alerting.
B — X-Ray service maps visualize service dependencies and performance bottlenecks.
C — VPC Flow Logs capture IP traffic information for network security analysis.
B — CloudTrail management events are retained for 90 days by default in the event history.
B — CloudWatch cross-account observability provides a unified monitoring view across accounts.
B — Subscription filters forward log entries matching a pattern to destinations in real time.
C — X-Ray traces end-to-end requests across microservices and AWS services.
A — CloudWatch Metrics are numerical time-series data; Logs contain event records.
B — CloudWatch Logs Insights uses a SQL-like query language to analyze log data.
C — CloudTrail Insights uses ML to detect unusual API activity and generate insights events.
A — Sampling rules define how much trace data to record, balancing detail with cost.
B — Amazon Inspector continuously scans EC2 instances and container images for vulnerabilities.
B — A CloudWatch Alarm monitors a metric and triggers an action when a threshold is breached.
A — Container Insights collects CPU, memory, network, and disk metrics from ECS and EKS.
B — OpenSearch Service provides a managed Elasticsearch-compatible engine for log analytics.
A — CloudWatch cross-account dashboards aggregate metrics from multiple accounts in one view.

📚 Additional Resources#

Next → Migration & Hybrid