Scalability

Scalability is the ability of a system to handle increased load by adding resources. The key question isn’t “how fast is it now” but “how does performance change as load increases” — and “what does it cost to double capacity.”

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

Add more power to the existing machine: more CPU, RAM, disk IOPS.

Before: 4 cores, 16GB RAM, 10K IOPS  →  handles1,000 RPS
After:  64 cores, 256GB RAM, 100K IOPS → handles ~10,000 RPS

Pros: No code changes, shared memory (no distributed state complexity), lower latency. Cons: Hardware ceiling, single point of failure, maintenance requires downtime.

Horizontal Scaling (Scale Out)

Add more machines to the pool.

Before: 2 app servers → handles 1,000 RPS
After:  20 app servers → handles10,000 RPS

Pros: Near-unlimited scale, fault tolerance (lose one node, pool absorbs it). Cons: Requires stateless design, session affinity considerations, distributed state management.

The Architecture Decision

Factor	Go Vertical	Go Horizontal
Team size	Small team	Large team with platform/SRE
Growth curve	Predictable, slow	Unpredictable or fast
Complexity tolerance	Low	High
Failure tolerance	Single node OK	Need resilience
Latency sensitivity	Very high (shared memory)	Moderate (stateless is fast enough)
Data layer	Single-writer DB	Sharded or distributed DB

Most architectures start vertical, then shift to horizontal when they hit hardware limits or need fault tolerance. A hybrid approach (vertical for data layer, horizontal for app layer) is common.

The Scalability Formula

For any system, identify the bottleneck — the component that hits its limit first:

Capacity = min(
    CPU_capacity / CPU_per_request,
    Memory_capacity / Memory_per_request,
    Network_capacity / Network_per_request,
    Disk_IOPS_capacity / Disk_IOPS_per_request,
    DB_connections_capacity / DB_connections_per_request
)

The smallest ratio is your actual capacity. Optimizing the wrong ratio wastes money.

Stateless Architecture

Horizontal scaling requires stateless application servers. Every request must contain all information needed to process it — no server-local session state.

# Stateless: session data lives in external store
GET /api/users/123/session
  → Redis lookup (session store) → return user data
  → Any app server can handle this request

# Stateful: session data lives on the server
GET /api/users/123/session
  → Local memory lookup → return user data
  → Only server-123 can handle this request (sticky session)

Rule: If you need state, push it to an external store (Redis, Memcached, DB). App servers are cattle, not pets.

Database Scalability Patterns

The database is almost always the scalability bottleneck. Solutions:

Read Replicas

Primary (writes) → Replica 1 (reads) → Replica 2 (reads) → ...

Route read queries to replicas, writes to primary. Works for read-heavy workloads (90/10 read/write is common). Replication lag means replica reads may be slightly stale — acceptable for most use cases, not for financial transactions.

Sharding

Split data across multiple database instances by shard key:

User ID % 4 == 0 → Shard 0
User ID % 4 == 1 → Shard 1
User ID % 4 == 2 → Shard 2
User ID % 4 == 3 → Shard 3

Shard key selection is critical. A bad shard key creates hot spots (one shard gets all writes for a power user). Common shard keys: user ID, geographic region, time-based (with caution).

CQRS (Command Query Responsibility Segregation)

Separate read and write models. Write to a optimized write store, project to a optimized read store:

Write path: API → Write DB (normalized, write-optimized)
Read path:  API → Read DB (denormalized, read-optimized)

Event: UserUpdated
  → Projection writes to read store
  → Report table updated asynchronously

High complexity, high payoff for write-heavy workloads with complex read patterns.

Caching for Scale

Caching reduces database load by serving repeated reads from memory:

Request → Cache hit? → return cached
 ↓ miss
                 DB query → cache result → return

Cache hit rate is the most important metric:

Hit rate = (cache hits) / (cache hits + cache misses)

90% hit rate → 10x reduction in DB load
99% hit rate → 100x reduction in DB load

Each additional 9 in hit rate is harder to achieve than the last. Diminishing returns kick in fast.

Auto-Scaling

Dynamically add/remove capacity based on demand:

# Kubernetes HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Scaling thresholds must account for:

Cold start time — new instances take 30-120 seconds to be ready
Cooldown period — prevent thrashing (scale up, then immediately scale down)
Predictable spikes — auto-scale reacts, scheduled scale pre-empts (daily traffic patterns, product launches)

Scaling Laws

Universal Scalability Law (USL) — as you add nodes, throughput increases, but not linearly. There’s overhead from coordination (locks, network messages, data partitioning). Eventually, adding nodes actually decreases throughput.

理想: Throughput ∝ N_nodes (linear)
现实: Throughput = N_nodes / (1 + α(N_nodes-1) + β(N_nodes-1)(N_nodes-2))
                                    ↑    ↑
 contention coordination

Key insight: Amdahl’s law says the fraction of the system that can’t be parallelized caps your maximum speedup. If10% of your code is serial, maximum speedup is 10x regardless of how many nodes you add.

When to Scale

Metrics that signal you need to scale:

CPU utilization > 70-80% sustained — headroom for spikes disappears
P99 latency increasing — system is approaching its limit
Queue depth growing — message queue backing up means producers outpace consumers
Error rate > 0.1% — system is in overload, dropping requests

Don’t wait for failures. Scale when utilization crosses60-70%, not when it hits 100%.

Common Scalability Mistakes

Premature sharding — adds massive complexity before you need it
Ignoring the bottleneck — scaling the wrong component (more app servers when DB is the bottleneck)
No connection pooling — DB connections are finite and slow to establish
Synchronous cache invalidation — invalidation on every write adds latency
No caching strategy — every request hits the database
Shard key hotspot — time-based shard keys cause hot shards during peak periods

Performance — latency and throughput fundamentals
Back-of-the-Envelope Calculations — quick capacity estimates
Performance Testing — load testing methodology
Caching — cache patterns and hit rates

cloudnative wiki

Explorer

Scalability

Scalability

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

The Architecture Decision

The Scalability Formula

Stateless Architecture

Database Scalability Patterns

Read Replicas

Sharding

CQRS (Command Query Responsibility Segregation)

Caching for Scale

Auto-Scaling

Scaling Laws

When to Scale

Common Scalability Mistakes

Graph View

Table of Contents

Backlinks

cloudnative wiki

Explorer

Scalability

Scalability

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

The Architecture Decision

The Scalability Formula

Stateless Architecture

Database Scalability Patterns

Read Replicas

Sharding

CQRS (Command Query Responsibility Segregation)

Caching for Scale

Auto-Scaling

Scaling Laws

When to Scale

Common Scalability Mistakes

Related

Graph View

Table of Contents

Backlinks