📊 Monitoring & Observability#

Learning Objectives#

  • Monitor infrastructure with CloudWatch metrics, logs, and alarms
  • Audit API activity with CloudTrail
  • Trace requests across services with X-Ray
  • Centralize logs with CloudWatch Logs Insights

1. Amazon CloudWatch#

1.1 CloudWatch Metrics#

CloudWatch monitors AWS resources and applications with metrics — time-series data points.

Built-in Metrics (AWS Services):

  • EC2: CPUUtilization, NetworkIn, NetworkOut, StatusCheckFailed
  • RDS: DatabaseConnections, ReadLatency, WriteLatency
  • ALB: RequestCount, TargetResponseTime, HealthyHostCount
  • Lambda: Invocations, Duration, Errors, Throttles
  • S3: BucketSizeBytes, NumberOfObjects

Custom Metrics:

# Put custom metric (memory usage, disk space, etc.)
aws cloudwatch put-metric-data \
  --namespace "Custom/AppMetrics" \
  --metric-data '[
    {"MetricName": "ActiveUsers", "Value": 542, "Unit": "Count"},
    {"MetricName": "ResponseTime", "StatisticValues": { "SampleCount": 100, "Sum": 2500, "Minimum": 10, "Maximum": 85 }}
  ]'

1.2 CloudWatch Alarms#

Alarm States: OK | ALARM | INSUFFICIENT_DATA

Metric → Alarm → SNS Topic → Email, SMS, Lambda, Auto Scaling
# Create alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "high-cpu-alarm" \
  --alarm-description "EC2 CPU > 80% for 5 minutes" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:...:ops-team \
  --dimensions Name=InstanceId,Value=i-abc123

Composite Alarms — Combine multiple alarms with AND/OR logic:

aws cloudwatch put-composite-alarm \
  --alarm-name "high-cpu-or-memory" \
  --alarm-rule "ALARM(high-cpu) OR ALARM(high-memory)" \
  --alarm-actions arn:aws:sns:...:ops-team

1.3 CloudWatch Logs#

Log Groups → Log Streams → Log Events

# Create log group
aws logs create-log-group --log-group-name /app/prod/web

# Set retention
aws logs put-retention-policy \
  --log-group-name /app/prod/web \
  --retention-in-days 90

# Subscribe log group to Lambda for real-time processing
aws logs put-subscription-filter \
  --log-group-name /app/prod/web \
  --filter-name error-filter \
  --filter-pattern "ERROR" \
  --destination-arn arn:aws:lambda:us-east-1:...:function:error-handler

CloudWatch Logs Insights — Query logs with SQL-like syntax:

fields @timestamp, @message
| filter @message like /ERROR|CRITICAL/
| stats count() by @logStream
| sort @timestamp desc
| limit 20

1.4 CloudWatch Dashboards#

Create custom dashboards with metrics from multiple services:

aws cloudwatch put-dashboard \
  --dashboard-name "Production-Overview" \
  --dashboard-body '{"widgets": [{ "type": "metric", "properties": { "metrics": [ ["AWS/EC2", "CPUUtilization", {"stat": "Average"}],
          ["AWS/RDS", "DatabaseConnections", {"stat": "Sum"}]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1",
        "title": "Production Metrics Overview"
      }
    }]
  }'

1.5 CloudWatch Container Insights#

Collect, aggregate, and summarize metrics from ECS, EKS, and Kubernetes:

  • CPU, memory, network, disk metrics
  • Performance log patterns
  • Service-level dashboards

2. AWS CloudTrail#

Purpose: Audit all API calls made in your AWS account.

Feature Description
Management Events CRUD on AWS resources (S3 create, EC2 launch)
Data Events S3 object-level, Lambda function invocation
Insights Events Unusual API activity detected by ML
Multi-Region Trail Logs all regions to a single S3 bucket
# Create multi-region trail
aws cloudtrail create-trail \
  --name "organization-trail" \
  --s3-bucket-name my-company-cloudtrail-logs \
  --is-multi-region-trail \
  --is-organization-trail \
  --enable-log-file-validation

# Start logging
aws cloudtrail start-logging --name "organization-trail"

CloudTrail Log Example:

{"eventVersion": "1.08", "userIdentity": { "type": "IAMUser", "arn": "arn:aws:iam::123456789012:user/admin", "accountId": "123456789012" },
  "eventTime": "2024-01-15T14:30:00Z",
  "eventSource": "ec2.amazonaws.com",
  "eventName": "RunInstances",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "203.0.113.42",
  "userAgent": "console.amazonaws.com",
  "requestParameters": {"instanceType": "t3.medium", "imageId": "ami-0abcdef1234567890" },
  "responseElements": {"instancesSet": { "items": [{"instanceId": "i-abc123"}]
    }
  }
}

3. AWS X-Ray#

Troubleshoot performance and errors by tracing requests across services:

User Request → API Gateway → Lambda → DynamoDB
                  │             │         │
                  └─────────────┴─────────┘
                          X-Ray Trace

Key Concepts:

  • Trace — End-to-end path of a request
  • Segment — Work done by a single service
  • Subsegment — Work done within a service (e.g., DB query)
  • Service Map — Visual representation of all services
# Enable X-Ray on Lambda
aws lambda update-function-configuration \
  --function-name process-order \
  --tracing-config Mode=Active

# X-Ray sampling rules
aws xray put-sampling-rule \
  --sampling-rule '{"RuleName": "production-sampling",
    "Priority": 1000,
    "ReservoirSize": 10,
    "FixedRate": 0.1,
    "Host": "*", "HTTPMethod": "*", "URLPath": "*", "ServiceName": "*", "ServiceType": "*" }'

4. AWS Config#

Track resource configuration changes:

# Enable Config recorder
aws configservice put-configuration-recorder \
  --configuration-recorder name=default,roleARN=arn:aws:iam::...:role/aws-config-role

# Enable delivery channel
aws configservice put-delivery-channel \
  --delivery-channel name=default,s3BucketName=my-config-bucket

# Start recording
aws configservice start-configuration-recorder --configuration-recorder-name=default

5. Monitoring Comparison#

Service Purpose Data Source
CloudWatch Performance metrics, logs, alarms AWS services, custom apps
CloudTrail API audit trail AWS API calls
X-Ray Request tracing, performance Application traces
Config Resource configuration changes AWS resource states
VPC Flow Logs Network traffic logs VPC network interfaces

6. Real-World Use Cases#

Use Case 1: Full Observability Stack for a Microservices Application#

Scenario: A company runs a microservices app with 20+ services on ECS Fargate. They need to debug slow API responses, trace requests across services, and get alerted on anomalies.

Solution: CloudWatch + X-Ray + Synthetics

graph TD
    subgraph App["Application Layer"]
        ALB["ALB"]:::aws
        SVC1["User Service"]:::app
        SVC2["Order Service"]:::app
        SVC3["Payment Service"]:::app
        RDS["RDS"]:::aws
        SQS["SQS"]:::aws
    end
    
    subgraph Observability["Observability Layer"]
        CW["CloudWatch\nMetrics + Logs + Alarms"]:::cw
        XRAY["X-Ray\nDistributed Tracing"]:::xray
        SYNTH["CloudWatch Synthetics\nCanary Monitoring"]:::cw
        DASH["CloudWatch Dashboard\nService Overview"]:::cw
    end
    
    subgraph Alerting["Alerting Layer"]
        SNS["SNS Topic"]:::aws
        PAGER["PagerDuty\nOpsGenie"]:::tool
        EMAIL["Email + Slack"]:::tool
    end
    
    ALB --> SVC1
    SVC1 --> SVC2
    SVC2 --> SVC3
    SVC1 --> RDS
    SVC2 --> SQS
    
    ALB -.->|Metrics| CW
    SVC1 -.->|Trace| XRAY
    SVC2 -.->|Trace| XRAY
    SVC3 -.->|Trace| XRAY
    XRAY -.->|Service Map| DASH
    CW -.->|Alarm| SNS
    SYNTH -.->|Synthetic checks| CW
    SNS --> PAGER
    SNS --> EMAIL
    
    classDef aws fill:#ff9900,color:#fff
    classDef app fill:#232f3e,color:#fff
    classDef cw fill:#527fff,color:#fff
    classDef xray fill:#00a4c7,color:#fff
    classDef tool fill:#666,color:#fff

Implementation steps:

# Step 1: Enable X-Ray tracing on ECS service
# Add to task definition:
# "environment": [{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000"}]

# Step 2: Create CloudWatch dashboard for service overview
aws cloudwatch put-dashboard --dashboard-name "Microservices-Overview" --dashboard-body '{"widgets": [{ "type": "metric", "properties": { "metrics": [ ["AWS/ApplicationELB", "TargetResponseTime", {"stat": "p95"}],
        ["ECS/ContainerInsights", "CpuUtilization", {"stat": "Average"}],
        ["ECS/ContainerInsights", "MemoryUtilization", {"stat": "Average"}]
      ],
      "period": 300,
      "title": "Service Health Overview"
    }
  }]
}'

# Step 3: Create composite alarm for service health
aws cloudwatch put-composite-alarm \
  --alarm-name "order-service-unhealthy" \
  --alarm-rule "ALARM(order-svc-high-latency) OR ALARM(order-svc-high-error)" \
  --alarm-actions arn:aws:sns:...:ops-team

Use Case 2: Centralized Logging for Multi-Account Environment#

Scenario: A company has 10 AWS accounts (dev, staging, prod, security, etc.). They need a centralized logging solution where the security team can search across all accounts.

Solution: Centralized Logging with Kinesis + OpenSearch

graph TD
    subgraph Accounts["Source Accounts"]
        PROD["Production\nAccount"]
        STAGING["Staging\nAccount"]
        DEV["Development\nAccount"]
        SEC["Security\nAccount"]
    end
    
    subgraph Ingestion["Central Logging Account"]
        CF["CloudFront Logs"]
        CT["CloudTrail\n(Organization Trail)"]
        CW_LOG["CloudWatch Logs\nCross-account Subscription\nFilter"]
        KDF["Kinesis Data\nFirehose"]
        LAMBDA["Lambda\nTransform/Enrich"]
    end
    
    subgraph Storage["Storage & Query"]
        S3_RAW["S3 - Raw Logs\n(Parquet)"]
        OS["OpenSearch Service\nInteractive Search"]
        ATHENA["Athena\nAd-hoc SQL Queries"]
    end
    
    subgraph Visualization["Visualization"]
        GRAFANA["Grafana\nDashboards"]
        QUICKSIGHT["QuickSight\nSecurity Reports"]
    end
    
    PROD --> CW_LOG
    STAGING --> CW_LOG
    DEV --> CW_LOG
    SEC --> CT
    CT --> KDF
    CW_LOG --> KDF
    KDF --> LAMBDA
    LAMBDA --> S3_RAW
    LAMBDA --> OS
    S3_RAW --> ATHENA
    OS --> GRAFANA
    ATHENA --> QUICKSIGHT

Cross-account log subscription setup:

# In source account: create subscription filter pointing to central account
aws logs put-subscription-filter \
  --log-group-name "/aws/lambda/prod-app" \
  --filter-name "central-logging" \
  --filter-pattern "" \
  --destination-arn "arn:aws:logs:us-east-1:CENTRAL_ACCOUNT:destination:all-logs"

# Organization CloudTrail (single trail for all accounts)
aws cloudtrail create-trail \
  --name "org-trail" \
  --s3-bucket-name "central-account-logs-bucket" \
  --is-organization-trail \
  --is-multi-region-trail

Use Case 3: Cost Anomaly Detection with CloudWatch#

Scenario: The finance team wants to detect unusual spikes in AWS spending before they become a surprise bill.

Solution: CloudWatch Anomaly Detection + Budget Alarms

graph LR
    subgraph Training["ML Model Training"]
        CW_METRICS["CloudWatch Metrics\nDaily spend data (90 days)"]
        ANOMALY["CloudWatch Anomaly Detection\nBand: Expected +/- 2 std dev"]
    end
    
    subgraph Monitoring["Real-time Monitoring"]
        CURRENT["Current day spend metric"]
        DETECT["Detect deviation\noutside expected band"]
    end
    
    subgraph Alert["Alerting"]
        ALARM["CloudWatch Alarm\nANOMALY_DETECTION_BAND"]
        SNS["SNS → Email/Slack"]
        LAMBDA["Lambda\nAuto-suspend resources"]
    end
    
    CW_METRICS --> ANOMALY
    CURRENT --> DETECT
    ANOMALY --> DETECT
    DETECT --> ALARM
    ALARM --> SNS
    ALARM --> LAMBDA
# Create anomaly detection alarm for estimated charges
aws cloudwatch put-metric-alarm \
  --alarm-name "cost-anomaly-alarm" \
  --alarm-description "Alert if daily spend is outside expected range" \
  --metric-name EstimatedCharges \
  --namespace AWS/Billing \
  --statistic Maximum \
  --period 86400 \
  --evaluation-periods 1 \
  --threshold-metric-id "ad1" \
  --metrics '[
    {"Id": "m1", "MetricStat": { "Metric": { "Namespace": "AWS/Billing", "MetricName": "EstimatedCharges" },
        "Period": 86400,
        "Stat": "Maximum"
      },
      "ReturnData": true
    },
    {"Id": "ad1", "Expression": "ANOMALY_DETECTION_BAND(m1, 2)", "Label": "Expected spend (2 std dev)", "ReturnData": true }
  ]'

Takeaway: CloudWatch Anomaly Detection uses ML to automatically establish a baseline and alert on deviations. Combined with AWS Budgets (forecasted alerts at 80%, 100%), you get proactive cost control.


✅ Chapter Quiz#

  1. Which CloudWatch alarm state means insufficient data is available?

    • A) OK
    • B) ALARM
    • C) INSUFFICIENT_DATA
    • D) UNKNOWN
  2. Which service records who deleted an S3 bucket?

    • A) CloudWatch
    • B) CloudTrail
    • C) Config
    • D) X-Ray
  3. What is the purpose of X-Ray service maps?

    • A) Track resource configuration changes
    • B) Visual representation of service dependencies
    • C) Monitor EC2 CPU usage
    • D) Store log files
  4. Which CloudWatch feature allows SQL-like querying of logs?

    • A) Log Groups
    • B) Log Streams
    • C) Logs Insights
    • D) Metric Filters
  5. How long can you retain CloudWatch Logs?

    • A) 30 days
    • B) 90 days
    • C) 365 days
    • D) Indefinitely (up to 10 years)
  6. Which CloudWatch feature allows you to run code to simulate user behavior and monitor application health?

    • A) CloudWatch Logs
    • B) CloudWatch Synthetics
    • C) CloudWatch Metrics
    • D) CloudWatch Alarms
  7. What is the purpose of an X-Ray segment?

    • A) To store log data
    • B) To represent the work done by a single service in a trace
    • C) To monitor EC2 CPU usage
    • D) To create dashboards
  8. Which AWS service provides a centralized view of resource configuration compliance?

    • A) CloudTrail
    • B) Config
    • C) CloudWatch
    • D) Service Catalog
  9. A company wants to collect operating system-level metrics from EC2 instances, including memory usage and disk space. How should they collect this data?

    • A) CloudWatch default metrics
    • B) CloudWatch agent
    • C) CloudTrail
    • D) VPC Flow Logs
  10. What does a CloudWatch composite alarm allow you to do?

    • A) Create alarms across multiple regions
    • B) Combine multiple alarms using AND/OR logic
    • C) Monitor billing metrics
    • D) Create dashboards
  11. Which X-Ray feature provides a visual representation of service dependencies?

    • A) Traces
    • B) Service maps
    • C) Segments
    • D) Sampling rules
  12. A company needs to monitor network traffic to and from EC2 instances for security analysis. Which service should they use?

    • A) CloudWatch
    • B) CloudTrail
    • C) VPC Flow Logs
    • D) Config
  13. What is the default retention period for CloudTrail management events?

    • A) 30 days
    • B) 90 days
    • C) 365 days
    • D) 7 years
  14. Which service would you use to get a unified view of health and performance across all AWS accounts in an organization?

    • A) CloudWatch Dashboard
    • B) CloudWatch cross-account observability
    • C) AWS Organizations
    • D) Service Catalog
  15. What is the purpose of CloudWatch Logs subscription filter?

    • A) To filter logs in the console
    • B) To forward log entries matching a pattern to another service in real time
    • C) To set retention policies
    • D) To export logs to S3
  16. A company wants to trace a user request as it travels through an API Gateway, Lambda function, and DynamoDB. Which service provides this end-to-end tracing?

    • A) CloudWatch
    • B) CloudTrail
    • C) X-Ray
    • D) Config
  17. What is the difference between a CloudWatch metric and a CloudWatch log?

    • A) Metrics are time-series data, logs are event records with timestamps
    • B) Logs are numerical, metrics are textual
    • C) Metrics are free, logs are always paid
    • D) There is no difference
  18. Which CloudWatch Logs feature allows you to run SQL-like queries across multiple log groups?

    • A) Metric filters
    • B) Logs Insights
    • C) Subscription filters
    • D) Contributor Insights
  19. A company needs to detect and investigate unusual API activity in their AWS account. Which CloudTrail feature should they enable?

    • A) Management events
    • B) Data events
    • C) Insights
    • D) Multi-region trail
  20. What is the purpose of X-Ray sampling rules?

    • A) To reduce the number of traces recorded to control costs
    • B) To sample logs for analysis
    • C) To select which API calls to record
    • D) To filter CloudWatch metrics
  21. Which AWS service provides automated vulnerability management for EC2 instances and container images?

    • A) GuardDuty
    • B) Inspector
    • C) Security Hub
    • D) Config
  22. A company wants to be notified if an EC2 instance’s CPU utilization exceeds 90% for 5 consecutive minutes. What should they create?

    • A) CloudWatch Logs
    • B) CloudWatch Alarm
    • C) CloudTrail
    • D) Config rule
  23. What is the purpose of CloudWatch Container Insights?

    • A) To monitor ECS and EKS container metrics
    • B) To scan container images for vulnerabilities
    • C) To manage container deployments
    • D) To store container logs
  24. Which AWS service provides a managed Elasticsearch-compatible search and analytics engine for log analysis?

    • A) CloudWatch Logs
    • B) OpenSearch Service
    • C) Athena
    • D) Kinesis Data Analytics
  25. A company wants to create a dashboard showing real-time operational metrics from multiple AWS accounts in a single view. Which CloudWatch feature supports this?

    • A) CloudWatch cross-account dashboards
    • B) CloudWatch Logs Insights
    • C) CloudWatch Synthetics
    • D) CloudWatch Contributor Insights
📝 Answer Key
  1. C — INSUFFICIENT_DATA means there’s not enough data to determine state.
  2. B — CloudTrail records all API calls including S3 DeleteBucket.
  3. B — X-Ray service maps show service dependencies and performance.
  4. C — CloudWatch Logs Insights uses a SQL-like query language.
  5. D — CloudWatch Logs can be retained indefinitely (max 10 years).
  6. B — CloudWatch Synthetics creates canaries that run configurable scripts to simulate user flows.
  7. B — An X-Ray segment records the work done by a single service within a trace.
  8. B — AWS Config tracks resource configuration changes and evaluates compliance continuously.
  9. B — The CloudWatch agent collects OS-level metrics like memory and disk usage from EC2 instances.
  10. B — Composite alarms combine multiple child alarms using AND/OR logic for complex alerting.
  11. B — X-Ray service maps visualize service dependencies and performance bottlenecks.
  12. C — VPC Flow Logs capture IP traffic information for network security analysis.
  13. B — CloudTrail management events are retained for 90 days by default in the event history.
  14. B — CloudWatch cross-account observability provides a unified monitoring view across accounts.
  15. B — Subscription filters forward log entries matching a pattern to destinations in real time.
  16. C — X-Ray traces end-to-end requests across microservices and AWS services.
  17. A — CloudWatch Metrics are numerical time-series data; Logs contain event records.
  18. B — CloudWatch Logs Insights uses a SQL-like query language to analyze log data.
  19. C — CloudTrail Insights uses ML to detect unusual API activity and generate insights events.
  20. A — Sampling rules define how much trace data to record, balancing detail with cost.
  21. B — Amazon Inspector continuously scans EC2 instances and container images for vulnerabilities.
  22. B — A CloudWatch Alarm monitors a metric and triggers an action when a threshold is breached.
  23. A — Container Insights collects CPU, memory, network, and disk metrics from ECS and EKS.
  24. B — OpenSearch Service provides a managed Elasticsearch-compatible engine for log analytics.
  25. A — CloudWatch cross-account dashboards aggregate metrics from multiple accounts in one view.

📚 Additional Resources#

Next → Migration & Hybrid