AWS Analytics

Analytics on AWS spans a spectrum from real-time streaming to batch warehousing to ad-hoc SQL queries. Each service solves a different problem — understanding where each fits is key to building a cost-effective data architecture.

Service Map

Real-time Streaming
├── Kinesis Data Streams     — real-time ingestion, shard-based
├── Kinesis Data Firehose    — near-real-time delivery to storage
└── Kinesis Data Analytics   — SQL-based stream processing

Ad-hoc Query
├── Athena                   — SQL queries directly on S3 data
└── OpenSearch               — search and log analytics

Data Warehouse
└── Redshift                 — petabyte-scale, columnar, SQL

ETL / Data Processing
├── Glue                     — managed ETL, crawlers, data catalog
└── EMR                      — managed Hadoop/Spark clusters

Data Lake
└── Lake Formation          — unified data lake governance

Real-time vs Batch Decision

WorkloadService
Ingest millions of events/sec, process in real-timeKinesis Data Streams + Kinesis Data Analytics
Deliver data to S3/Redshift/Druid with minimal processingKinesis Data Firehose
Ad-hoc SQL on log files or data lakeAthena
Dashboarding and BI on structured dataRedshift
ETL between databases and data lakesGlue
Large-scale distributed processing (Spark, Hadoop)EMR
Full-text search on large datasetsOpenSearch

Data Flow Patterns

Lambda Architecture (classic)

Real-time layer: Kinesis → Kinesis Data Analytics (SQL) → DynamoDB/ES
Batch layer: S3 → Glue → Redshift
Serving layer: Merge real-time + batch results for queries

Kappa Architecture (simplified)

Kinesis → Kinesis Data Analytics (continuous SQL) → serving layer
S3 as the immutable log (no separate batch layer)

Modern Data Stack

Ingestion: DMS, Firehose, Kafka Connect
Storage: S3 (raw) + S3 (processed)
Catalog: Glue Data Catalog
Processing: Glue ETL, Athena, Redshift Spectrum
Visualization: QuickSight, Tableau, Grafana