Amazon Kinesis

Kinesis is AWS’s real-time data streaming platform. Three services cover different parts of the streaming pipeline:

Data Sources (IoT, Logs, Events, Clicks)
    ↓
Kinesis Data Streams     ← Ingest and process real-time
    ↓
Kinesis Data Firehose   ← Deliver to S3/Redshift/OpenSearch (managed, no custom code)
    ↓
Kinesis Data Analytics  ← SQL/Flink queries on streams

Services at a Glance

ServiceUse WhenKey Benefit
Data StreamsYou need real-time consumers, replay, custom processingFull control, exactly-once, consumer groups
Data FirehoseYou just need data delivered to S3/RedshiftFully managed, no consumer apps needed
Data AnalyticsYou want to query streams with SQLManaged SQL/Flink, no cluster to manage

Data Streams vs Firehose

Kinesis Data Streams:
  Producer → Stream (shards) → Consumer (KCL/KPL) → Processing
  You manage: shard count, consumer apps, scaling

Kinesis Data Firehose:
  Producer → Firehose → Buffer → Transform (optional) → S3/Redshift/OpenSearch
  You manage: nothing — AWS handles buffering, delivery, retries

Common pattern: Data Streams for real-time processing + Firehose for durable S3 archival. Use Data Streams when you need to consume and process data in real-time. Use Firehose when you just need to land data in storage with minimal latency.

Kinesis Data Streams

  • Shards: Throughput unit. 1MB/s ingress, 2MB/s egress, 1,000 records/s per shard.
  • Producers: KPL (Producer Library) for high-throughput, batching, aggregation.
  • Consumers: KCL (Consumer Library) for automatic shard distribution, checkpointing, lease management.
  • Enhanced fan-out: Dedicated 2MB/s per consumer (vs shared 2MB/s in standard).
  • Scaling: Split or merge shards, or use on-demand mode for auto-scaling.
  • Retention: 24 hours default, up to 365 days with extended retention.
  • Max record size: 1MB.

Kinesis Data Firehose

  • Buffer: Configurable size (1-128MB) and interval (60-900s). Delivery triggers when either is reached.
  • Destinations: S3, Redshift (via S3 staging), Elasticsearch, Splunk, HTTP endpoint.
  • Transformation: Lambda invoked on each batch for transformation before delivery.
  • Formats: JSON, CSV, Parquet, ORC. Parquet recommended for analytical queries.
  • No replay: Firehose is append-only, no way to replay from a timestamp.

Kinesis Data Analytics

  • SQL applications: Write standard SQL against input streams. Tumbling, sliding, session windows.
  • Apache Flink: Full Flink API for complex event processing, custom state management.
  • Reference data: Enrich stream data with S3-hosted lookup tables (user profiles, product catalog).
  • Windows: STEP() for tumbling, FLOOR(timestamp TO HOUR) for sliding, SESSION(gap => INTERVAL) for session.
  • Output: Kinesis Streams, Firehose, Lambda.
  • Checkpointing: Automatic checkpoint to S3 for fault tolerance and exactly-once semantics.
  • Exactly-once: Guaranteed with Kinesis Streams destination. At-least-once with Firehose/Lambda.

Partition Key Strategy

Partition key determines which shard a record goes to:

Low cardinality (e.g., "all-events") → one shard → bottleneck
High cardinality (e.g., user_id, device_id) → even distribution → scales

For time-series data, include a time component in the partition key if you need ordering within a time window, or use a UUID if ordering doesn’t matter.

Cost Optimization

  • Aggregation (KPL): Combine multiple application records into one Kinesis record → more records per second per shard.
  • On-demand mode: Pay per stream-hour and per MB, auto-scales. Good for variable workloads.
  • Enhanced fan-out: Additional cost per shard-hour and per GB delivered. Only use when multiple consumers need dedicated throughput.
  • Firehose buffer tuning: Larger buffers (128MB, 900s) reduce S3 write frequency and cost.

Common Architecture Patterns

Lambda Architecture (classic)

Kinesis Streams → Kinesis Analytics (real-time) → DynamoDB/Kinesis Firehose
S3 → Glue → Redshift (batch layer)
Merge at query time (Athena, Redshift)

Kappa Architecture (simplified)

Kinesis Streams → Kinesis Analytics → Kinesis Firehose → S3
S3 as immutable log — no separate batch layer
Re-process by seeking to beginning of stream

Event Sourcing

User actions → Kinesis Streams → multiple consumers
  ├── Fraud detection (Lambda)
  ├── Personalization engine (Lambda)
  ├── Audit log archival (Firehose → S3)
  └── Real-time dashboards (Kinesis Analytics)