Data Lake Analytics#

A scalable data lake for analytics using S3, Glue, Athena, Kinesis, and QuickSight — ingest, catalog, query, and visualize data at petabyte scale with minimal operational overhead.

Architecture Overview#

                     ┌─────────────────────────────────────┐
                     │         Data Sources                 │
                     │  ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐  │
                     │  │ App │ │ IoT │ │ DB   │ │ 3rd │  │
                     │  │ Logs│ │Data │ │ CDC  │ │Party│  │
                     │  └──┬──┘ └──┬──┘ └──┬───┘ └──┬──┘  │
                     └─────┼───────┼───────┼────────┼──────┘
                           │       │       │        │
              ┌────────────┼───────┼───────┼────────┼──────┐
              │     ┌──────▼──┐ ┌──▼───┐ ┌▼──────┐ ┌▼───┐ │
              │     │Kinesis  │ │Kinesis│ │DMS    │ │App │ │
              │     │Streams  │ │Firehose│ │(CDC)  │ │Sync│ │
              │     └──────┬──┘ └──┬───┘ └──┬────┘ └──┬──┘ │
              │            │       │        │         │     │
              │     ┌──────▼───────▼────────▼─────────▼──┐ │
              │     │        S3 Data Lake (Raw)          │ │
              │     │  s3://company-lake/raw/            │ │
              │     └────────────────┬───────────────────┘ │
              │                      │                     │
              │     ┌────────────────▼───────────────────┐ │
              │     │     AWS Glue (Crawlers + ETL)      │ │
              │     │  • Crawl S3 to populate catalog    │ │
              │     │  • ETL: clean, transform, enrich   │ │
              │     │  • Partition data by date/region   │ │
              │     └────────────────┬───────────────────┘ │
              │                      │                     │
              │     ┌────────────────▼───────────────────┐ │
              │     │        S3 Data Lake (Curated)       │ │
              │     │  s3://company-lake/curated/         │ │
              │     │  Parquet format, partitioned        │ │
              │     └──┬────────────┬────────────┬────────┘ │
              │        │            │            │          │
              │   ┌────▼───┐  ┌────▼────┐ ┌────▼──────┐   │
              │   │ Athena │  │Redshift │ │QuickSight │   │
              │   │  (SQL) │  │Spectrum │ │(BI/Vis)   │   │
              │   └────────┘  └─────────┘ └───────────┘   │
              └───────────────────────────────────────────┘

Services Used#

Service	Purpose	Configuration
Kinesis Data Streams	Real-time data ingestion	10 shards, 24-hour retention, server-side encryption
Kinesis Data Firehose	Batch data delivery to S3	Buffer 60s/10MB, Parquet conversion, compression
DMS	Database CDC ingestion	Ongoing replication from RDS to S3
S3	Data lake storage	Lifecycle: Standard -> Glacier after 90d, partition by year/month/day
Glue	Data catalog & ETL	Crawlers for schema discovery, ETL jobs in PySpark
Lake Formation	Data lake governance	Column-level access control, row filtering
Athena	Ad-hoc SQL queries	Partition projection, columnar format (Parquet), compression
Redshift Spectrum	Complex analytics	External tables over S3 data, joins with Redshift local tables
QuickSight	Dashboards & visualization	SPICE engine for fast queries, ML Insights

Key Design Decisions#

Decision	Rationale
S3 as the single source of truth	Durable (11 9s), cheap ($23/TB/mo), scales infinitely
Parquet format + compression	50-80% less storage vs CSV, 5-10x faster queries in Athena
Partitioning by date	Most queries filter by date – partition pruning reduces scan costs
Raw vs Curated zones	Raw preserves original data; Curated is cleaned for analytics
Glue over custom ETL	Serverless, schema discovery, job scheduling built-in
Lake Formation for governance	Centralized permissions across Glue, Athena, QuickSight

Real-World Use Case#

Scenario: A media company ingests 5TB/day of user activity logs, ad impressions, and content metadata.

Data flow:

Ingestion: App servers stream events to Kinesis (5K events/sec)
Storage: Firehose batches data to S3 every 60 seconds in Parquet format
Catalog: Glue crawlers run every 6 hours to update the Data Catalog
Transform: Glue ETL jobs clean raw data, join with reference tables, write curated zones
Query: Analysts query via Athena ($5/TB scanned) for ad-hoc exploration
Visualize: QuickSight dashboards show real-time metrics to the executive team

Glue ETL Job Example#

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

# Read raw data from S3
raw_df = glueContext.create_dynamic_frame.from_catalog(
    database="company_lake", table_name="raw_user_activity")

# Clean and transform
cleaned = raw_df.drop_fields(["internal_metadata"]).filter(
    lambda row: row["event_type"].isNotNull())

# Write curated Parquet back to S3 with partitioning
glueContext.write_dynamic_frame.from_catalog(
    frame=cleaned, database="company_lake",
    table_name="curated_user_activity",
    additional_options={"partitionKeys": ["year", "month", "day"]})

Athena Query Example#

-- Daily active users by content category
SELECT date_trunc('day', event_time) as day,
       content_category,
       count(distinct user_id) as dau
FROM curated_user_activity
WHERE event_type = 'page_view'
  AND year = '2024' AND month = '12'
GROUP BY 1, 2
ORDER BY 1, 3 DESC;

-- Cost: scanned ~120GB, ~$0.60 per query

Cost Estimate (Monthly)#

Service	Estimated Cost
S3 Storage (150TB)	~$3,450
Kinesis Data Streams	~$720
Kinesis Firehose	~$300
Glue ETL (200 hours)	~$880
Athena queries (50TB scanned)	~$250
QuickSight (10 authors)	~$240
Total	~$5,840/month

Key Exam Takeaways#

S3 is the foundation of any data lake – cheap, durable, infinitely scalable
Parquet + compression + partitioning = Athena cost optimization trifecta
Glue Crawlers populate the Data Catalog automatically
Lake Formation provides column/row-level security for data lakes
Kinesis Firehose is the easiest way to get streaming data into S3
Redshift Spectrum queries S3 directly without loading data
Athena is serverless – pay per query, no cluster to manage

Data Lake Analytics#

Architecture Overview#

Services Used#

Key Design Decisions#

Real-World Use Case#

Glue ETL Job Example#

Athena Query Example#

Cost Estimate (Monthly)#

Key Exam Takeaways#

References#