Data Lake Analytics#

A scalable data lake for analytics using S3, Glue, Athena, Kinesis, and QuickSight — ingest, catalog, query, and visualize data at petabyte scale with minimal operational overhead.

Architecture Overview#

                     ┌─────────────────────────────────────┐
                     │         Data Sources                 │
                     │  ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐  │
                     │  │ App │ │ IoT │ │ DB   │ │ 3rd │  │
                     │  │ Logs│ │Data │ │ CDC  │ │Party│  │
                     │  └──┬──┘ └──┬──┘ └──┬───┘ └──┬──┘  │
                     └─────┼───────┼───────┼────────┼──────┘
                           │       │       │        │
              ┌────────────┼───────┼───────┼────────┼──────┐
              │     ┌──────▼──┐ ┌──▼───┐ ┌▼──────┐ ┌▼───┐ │
              │     │Kinesis  │ │Kinesis│ │DMS    │ │App │ │
              │     │Streams  │ │Firehose│ │(CDC)  │ │Sync│ │
              │     └──────┬──┘ └──┬───┘ └──┬────┘ └──┬──┘ │
              │            │       │        │         │     │
              │     ┌──────▼───────▼────────▼─────────▼──┐ │
              │     │        S3 Data Lake (Raw)          │ │
              │     │  s3://company-lake/raw/            │ │
              │     └────────────────┬───────────────────┘ │
              │                      │                     │
              │     ┌────────────────▼───────────────────┐ │
              │     │     AWS Glue (Crawlers + ETL)      │ │
              │     │  • Crawl S3 to populate catalog    │ │
              │     │  • ETL: clean, transform, enrich   │ │
              │     │  • Partition data by date/region   │ │
              │     └────────────────┬───────────────────┘ │
              │                      │                     │
              │     ┌────────────────▼───────────────────┐ │
              │     │        S3 Data Lake (Curated)       │ │
              │     │  s3://company-lake/curated/         │ │
              │     │  Parquet format, partitioned        │ │
              │     └──┬────────────┬────────────┬────────┘ │
              │        │            │            │          │
              │   ┌────▼───┐  ┌────▼────┐ ┌────▼──────┐   │
              │   │ Athena │  │Redshift │ │QuickSight │   │
              │   │  (SQL) │  │Spectrum │ │(BI/Vis)   │   │
              │   └────────┘  └─────────┘ └───────────┘   │
              └───────────────────────────────────────────┘

Services Used#

Service Purpose Configuration
Kinesis Data Streams Real-time data ingestion 10 shards, 24-hour retention, server-side encryption
Kinesis Data Firehose Batch data delivery to S3 Buffer 60s/10MB, Parquet conversion, compression
DMS Database CDC ingestion Ongoing replication from RDS to S3
S3 Data lake storage Lifecycle: Standard -> Glacier after 90d, partition by year/month/day
Glue Data catalog & ETL Crawlers for schema discovery, ETL jobs in PySpark
Lake Formation Data lake governance Column-level access control, row filtering
Athena Ad-hoc SQL queries Partition projection, columnar format (Parquet), compression
Redshift Spectrum Complex analytics External tables over S3 data, joins with Redshift local tables
QuickSight Dashboards & visualization SPICE engine for fast queries, ML Insights

Key Design Decisions#

Decision Rationale
S3 as the single source of truth Durable (11 9s), cheap ($23/TB/mo), scales infinitely
Parquet format + compression 50-80% less storage vs CSV, 5-10x faster queries in Athena
Partitioning by date Most queries filter by date – partition pruning reduces scan costs
Raw vs Curated zones Raw preserves original data; Curated is cleaned for analytics
Glue over custom ETL Serverless, schema discovery, job scheduling built-in
Lake Formation for governance Centralized permissions across Glue, Athena, QuickSight

Real-World Use Case#

Scenario: A media company ingests 5TB/day of user activity logs, ad impressions, and content metadata.

Data flow:

  1. Ingestion: App servers stream events to Kinesis (5K events/sec)
  2. Storage: Firehose batches data to S3 every 60 seconds in Parquet format
  3. Catalog: Glue crawlers run every 6 hours to update the Data Catalog
  4. Transform: Glue ETL jobs clean raw data, join with reference tables, write curated zones
  5. Query: Analysts query via Athena ($5/TB scanned) for ad-hoc exploration
  6. Visualize: QuickSight dashboards show real-time metrics to the executive team

Glue ETL Job Example#

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

# Read raw data from S3
raw_df = glueContext.create_dynamic_frame.from_catalog(
    database="company_lake", table_name="raw_user_activity")

# Clean and transform
cleaned = raw_df.drop_fields(["internal_metadata"]).filter(
    lambda row: row["event_type"].isNotNull())

# Write curated Parquet back to S3 with partitioning
glueContext.write_dynamic_frame.from_catalog(
    frame=cleaned, database="company_lake",
    table_name="curated_user_activity",
    additional_options={"partitionKeys": ["year", "month", "day"]})

Athena Query Example#

-- Daily active users by content category
SELECT date_trunc('day', event_time) as day,
       content_category,
       count(distinct user_id) as dau
FROM curated_user_activity
WHERE event_type = 'page_view'
  AND year = '2024' AND month = '12'
GROUP BY 1, 2
ORDER BY 1, 3 DESC;

-- Cost: scanned ~120GB, ~$0.60 per query

Cost Estimate (Monthly)#

Service Estimated Cost
S3 Storage (150TB) ~$3,450
Kinesis Data Streams ~$720
Kinesis Firehose ~$300
Glue ETL (200 hours) ~$880
Athena queries (50TB scanned) ~$250
QuickSight (10 authors) ~$240
Total ~$5,840/month

Key Exam Takeaways#

  • S3 is the foundation of any data lake – cheap, durable, infinitely scalable
  • Parquet + compression + partitioning = Athena cost optimization trifecta
  • Glue Crawlers populate the Data Catalog automatically
  • Lake Formation provides column/row-level security for data lakes
  • Kinesis Firehose is the easiest way to get streaming data into S3
  • Redshift Spectrum queries S3 directly without loading data
  • Athena is serverless – pay per query, no cluster to manage

References#