Data Lake Analytics#
A scalable data lake for analytics using S3, Glue, Athena, Kinesis, and QuickSight — ingest, catalog, query, and visualize data at petabyte scale with minimal operational overhead.
Architecture Overview#
┌─────────────────────────────────────┐
│ Data Sources │
│ ┌─────┐ ┌─────┐ ┌──────┐ ┌─────┐ │
│ │ App │ │ IoT │ │ DB │ │ 3rd │ │
│ │ Logs│ │Data │ │ CDC │ │Party│ │
│ └──┬──┘ └──┬──┘ └──┬───┘ └──┬──┘ │
└─────┼───────┼───────┼────────┼──────┘
│ │ │ │
┌────────────┼───────┼───────┼────────┼──────┐
│ ┌──────▼──┐ ┌──▼───┐ ┌▼──────┐ ┌▼───┐ │
│ │Kinesis │ │Kinesis│ │DMS │ │App │ │
│ │Streams │ │Firehose│ │(CDC) │ │Sync│ │
│ └──────┬──┘ └──┬───┘ └──┬────┘ └──┬──┘ │
│ │ │ │ │ │
│ ┌──────▼───────▼────────▼─────────▼──┐ │
│ │ S3 Data Lake (Raw) │ │
│ │ s3://company-lake/raw/ │ │
│ └────────────────┬───────────────────┘ │
│ │ │
│ ┌────────────────▼───────────────────┐ │
│ │ AWS Glue (Crawlers + ETL) │ │
│ │ • Crawl S3 to populate catalog │ │
│ │ • ETL: clean, transform, enrich │ │
│ │ • Partition data by date/region │ │
│ └────────────────┬───────────────────┘ │
│ │ │
│ ┌────────────────▼───────────────────┐ │
│ │ S3 Data Lake (Curated) │ │
│ │ s3://company-lake/curated/ │ │
│ │ Parquet format, partitioned │ │
│ └──┬────────────┬────────────┬────────┘ │
│ │ │ │ │
│ ┌────▼───┐ ┌────▼────┐ ┌────▼──────┐ │
│ │ Athena │ │Redshift │ │QuickSight │ │
│ │ (SQL) │ │Spectrum │ │(BI/Vis) │ │
│ └────────┘ └─────────┘ └───────────┘ │
└───────────────────────────────────────────┘Services Used#
| Service | Purpose | Configuration |
|---|---|---|
| Kinesis Data Streams | Real-time data ingestion | 10 shards, 24-hour retention, server-side encryption |
| Kinesis Data Firehose | Batch data delivery to S3 | Buffer 60s/10MB, Parquet conversion, compression |
| DMS | Database CDC ingestion | Ongoing replication from RDS to S3 |
| S3 | Data lake storage | Lifecycle: Standard -> Glacier after 90d, partition by year/month/day |
| Glue | Data catalog & ETL | Crawlers for schema discovery, ETL jobs in PySpark |
| Lake Formation | Data lake governance | Column-level access control, row filtering |
| Athena | Ad-hoc SQL queries | Partition projection, columnar format (Parquet), compression |
| Redshift Spectrum | Complex analytics | External tables over S3 data, joins with Redshift local tables |
| QuickSight | Dashboards & visualization | SPICE engine for fast queries, ML Insights |
Key Design Decisions#
| Decision | Rationale |
|---|---|
| S3 as the single source of truth | Durable (11 9s), cheap ($23/TB/mo), scales infinitely |
| Parquet format + compression | 50-80% less storage vs CSV, 5-10x faster queries in Athena |
| Partitioning by date | Most queries filter by date – partition pruning reduces scan costs |
| Raw vs Curated zones | Raw preserves original data; Curated is cleaned for analytics |
| Glue over custom ETL | Serverless, schema discovery, job scheduling built-in |
| Lake Formation for governance | Centralized permissions across Glue, Athena, QuickSight |
Real-World Use Case#
Scenario: A media company ingests 5TB/day of user activity logs, ad impressions, and content metadata.
Data flow:
- Ingestion: App servers stream events to Kinesis (5K events/sec)
- Storage: Firehose batches data to S3 every 60 seconds in Parquet format
- Catalog: Glue crawlers run every 6 hours to update the Data Catalog
- Transform: Glue ETL jobs clean raw data, join with reference tables, write curated zones
- Query: Analysts query via Athena ($5/TB scanned) for ad-hoc exploration
- Visualize: QuickSight dashboards show real-time metrics to the executive team
Glue ETL Job Example#
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
# Read raw data from S3
raw_df = glueContext.create_dynamic_frame.from_catalog(
database="company_lake", table_name="raw_user_activity")
# Clean and transform
cleaned = raw_df.drop_fields(["internal_metadata"]).filter(
lambda row: row["event_type"].isNotNull())
# Write curated Parquet back to S3 with partitioning
glueContext.write_dynamic_frame.from_catalog(
frame=cleaned, database="company_lake",
table_name="curated_user_activity",
additional_options={"partitionKeys": ["year", "month", "day"]})Athena Query Example#
-- Daily active users by content category
SELECT date_trunc('day', event_time) as day,
content_category,
count(distinct user_id) as dau
FROM curated_user_activity
WHERE event_type = 'page_view'
AND year = '2024' AND month = '12'
GROUP BY 1, 2
ORDER BY 1, 3 DESC;
-- Cost: scanned ~120GB, ~$0.60 per queryCost Estimate (Monthly)#
| Service | Estimated Cost |
|---|---|
| S3 Storage (150TB) | ~$3,450 |
| Kinesis Data Streams | ~$720 |
| Kinesis Firehose | ~$300 |
| Glue ETL (200 hours) | ~$880 |
| Athena queries (50TB scanned) | ~$250 |
| QuickSight (10 authors) | ~$240 |
| Total | ~$5,840/month |
Key Exam Takeaways#
- S3 is the foundation of any data lake – cheap, durable, infinitely scalable
- Parquet + compression + partitioning = Athena cost optimization trifecta
- Glue Crawlers populate the Data Catalog automatically
- Lake Formation provides column/row-level security for data lakes
- Kinesis Firehose is the easiest way to get streaming data into S3
- Redshift Spectrum queries S3 directly without loading data
- Athena is serverless – pay per query, no cluster to manage