Memory Leaks
A memory leak is memory allocated but no longer referenced — the garbage collector can’t free it because something still holds a reference, or the code forgot to release it.
In long-running processes (servers, agents, batch jobs), leaks compound until the process runs out of memory and crashes.
ELI5
You ask the kitchen for a plate → they give you one → you eat, but
NEVER return the plate.
Eventually the kitchen has no plates left → can't serve anyone → crash.
In code: you malloc() but never free(). Or in GC’d languages, you hold references to objects you no longer need.
Common Causes
1. Unbounded Caches
# ❌ Leaky: cache grows forever
cache = {}
def get_user(user_id):
if user_id not in cache:
cache[user_id] = db.fetch_user(user_id)
return cache[user_id]Fix: Use functools.lru_cache with max size, or TTL-based cache.
# ✅ Fixed: bounded LRU cache
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_user(user_id):
return db.fetch_user(user_id)2. Event Listener Accumulation
// ❌ Leaky: new listener added on every request
app.get('/subscribe', (req, res) => {
eventEmitter.on('update', () => {
res.send('notification');
});
});Fix: Remove listener when done, or use a once-off pattern.
// ✅ Fixed: one-time listener
eventEmitter.once('update', () => {
res.send('notification');
});3. Global State Accumulators
# ❌ Leaky: list grows unbounded
connected_users = []
def on_user_connect(user):
connected_users.append(user) # never removedFix: Use a bounded structure or explicitly manage lifecycle.
4. Closures Holding References
# ❌ Leaky: closure captures large object permanently
def create_handler(large_dataframe):
def handler(request):
return process(large_dataframe) # large_dataframe lives as long as handler
return handlerFix: Don’t capture large objects in closures if the closure outlives the use case.
5. Connection Pools Not Closed
# ❌ Leaky: connection opened, never closed
def get_data():
conn = psycopg2.connect(DATABASE_URL)
return conn.execute("SELECT * FROM events")
# conn.close() never calledFix: Context manager or finally block.
# ✅ Fixed
def get_data():
with psycopg2.connect(DATABASE_URL) as conn:
return conn.execute("SELECT * FROM events")Detection
Python
# Tracemalloc — find memory allocation by line
python -m tracemalloc -m tracemalloc start
# Or in prod: objgraph
pip install objgraph
python -c "
import objgraph
objgraph.show_most_common_types(limit=20)
"Go
# pprof — heap profiling
go tool pprof http://localhost:6060/debug/pprof/heapProcess-level (Linux)
# Watch RSS of a process over time
pidstat -r -p $(pgrep -f myservice) 1
# Or
while true; do
echo "$(date): $(ps -o rss= -p $(pgrep -f myservice)) KB"
sleep 10
donePrevention Checklist
□ Bounded caches (LRU with maxsize, or TTL eviction)
□ Event listeners removed when no longer needed
□ Global state has explicit lifecycle management
□ Closures don't capture large/heavy objects
□ DB connections use context managers (with block)
□ Background jobs / agents have max lifetime + restart policy
□ Health checks include memory metrics
□ Crash-only design: OOM kills process, orchestrator restarts
Architecture Impact
For solution architects, memory leaks in data plane components (sidecar proxies, agents, middleware) are higher severity than in batch workers — they cause cascading failures.
┌─────────────┐
Service A ─────────▶│ Envoy │ ◀── leak here = all services affected
│ (sidecar) │
└─────────────┘
K8s: Set resource limits. Let OOMKilled restart the pod rather than leak indefinitely.
resources:
limits:
memory: 256Mi # pod dies and restarts on leak, doesn't starve others