Memory Leaks

A memory leak is memory allocated but no longer referenced — the garbage collector can’t free it because something still holds a reference, or the code forgot to release it.

In long-running processes (servers, agents, batch jobs), leaks compound until the process runs out of memory and crashes.

ELI5

You ask the kitchen for a plate → they give you one → you eat, but
NEVER return the plate.
Eventually the kitchen has no plates left → can't serve anyone → crash.

In code: you malloc() but never free(). Or in GC’d languages, you hold references to objects you no longer need.

Common Causes

1. Unbounded Caches

# ❌ Leaky: cache grows forever
cache = {}
 
def get_user(user_id):
    if user_id not in cache:
        cache[user_id] = db.fetch_user(user_id)
    return cache[user_id]

Fix: Use functools.lru_cache with max size, or TTL-based cache.

# ✅ Fixed: bounded LRU cache
from functools import lru_cache
 
@lru_cache(maxsize=1000)
def get_user(user_id):
    return db.fetch_user(user_id)

2. Event Listener Accumulation

// ❌ Leaky: new listener added on every request
app.get('/subscribe', (req, res) => {
  eventEmitter.on('update', () => {
    res.send('notification');
  });
});

Fix: Remove listener when done, or use a once-off pattern.

// ✅ Fixed: one-time listener
eventEmitter.once('update', () => {
  res.send('notification');
});

3. Global State Accumulators

# ❌ Leaky: list grows unbounded
connected_users = []
 
def on_user_connect(user):
    connected_users.append(user)  # never removed

Fix: Use a bounded structure or explicitly manage lifecycle.

4. Closures Holding References

# ❌ Leaky: closure captures large object permanently
def create_handler(large_dataframe):
    def handler(request):
        return process(large_dataframe)  # large_dataframe lives as long as handler
    return handler

Fix: Don’t capture large objects in closures if the closure outlives the use case.

5. Connection Pools Not Closed

# ❌ Leaky: connection opened, never closed
def get_data():
    conn = psycopg2.connect(DATABASE_URL)
    return conn.execute("SELECT * FROM events")
    # conn.close() never called

Fix: Context manager or finally block.

# ✅ Fixed
def get_data():
    with psycopg2.connect(DATABASE_URL) as conn:
        return conn.execute("SELECT * FROM events")

Detection

Python

# Tracemalloc — find memory allocation by line
python -m tracemalloc -m tracemalloc start
 
# Or in prod: objgraph
pip install objgraph
python -c "
import objgraph
objgraph.show_most_common_types(limit=20)
"

Go

# pprof — heap profiling
go tool pprof http://localhost:6060/debug/pprof/heap

Process-level (Linux)

# Watch RSS of a process over time
pidstat -r -p $(pgrep -f myservice) 1
 
# Or
while true; do
  echo "$(date): $(ps -o rss= -p $(pgrep -f myservice)) KB"
  sleep 10
done

Prevention Checklist

□ Bounded caches (LRU with maxsize, or TTL eviction)
□ Event listeners removed when no longer needed
□ Global state has explicit lifecycle management
□ Closures don't capture large/heavy objects
□ DB connections use context managers (with block)
□ Background jobs / agents have max lifetime + restart policy
□ Health checks include memory metrics
□ Crash-only design: OOM kills process, orchestrator restarts

Architecture Impact

For solution architects, memory leaks in data plane components (sidecar proxies, agents, middleware) are higher severity than in batch workers — they cause cascading failures.

                    ┌─────────────┐
Service A ─────────▶│   Envoy     │ ◀── leak here = all services affected
                    │  (sidecar)  │
                    └─────────────┘

K8s: Set resource limits. Let OOMKilled restart the pod rather than leak indefinitely.

resources:
  limits:
    memory: 256Mi  # pod dies and restarts on leak, doesn't starve others

cloudnative wiki

Explorer

Memory Leaks

Memory Leaks

ELI5

Common Causes

1. Unbounded Caches

2. Event Listener Accumulation

3. Global State Accumulators

4. Closures Holding References

5. Connection Pools Not Closed

Detection

Python

Go

Process-level (Linux)

Prevention Checklist

Architecture Impact

Source

Graph View

Table of Contents