Custom Controllers
“https://kubernetes.io/docs/concepts/architecture/controller/”
A custom controller is a control loop that watches the state of the cluster and makes changes to move the actual state toward the desired state. Every useful thing in k8s is a controller — Deployments, ReplicaSets, StatefulSets, DaemonSets, Jobs, the cloud-controller-manager. Custom controllers are how you add your own.
The control loop in one sentence
Observe → Diff → Act → Repeat
The controller:
- Observes the current state of the cluster (via the apiserver’s watch API)
- Diffs it against the desired state (in a Spec somewhere — a Deployment, a CR, etc.)
- Acts to make the actual state match the desired (create / update / delete resources)
- Repeats, usually with a small delay between iterations
This is a reconciliation loop. The controller “reconciles” the actual state toward the desired state.
The pattern, step by step
┌──────────────────────────────────────────────────────┐
│ desired state: Deployment with replicas=3 │
└────────────────────────┬─────────────────────────────┘
│
▼
┌──────────────────────┐
│ Watch Deployments │ ← controller subscribes
│ via the apiserver │
└──────────┬───────────┘
│ event: deployment "web" updated
▼
┌──────────────────────┐
│ Enqueue work item │
│ (workqueue) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Worker picks up │
│ the work item │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Reconcile() │
│ │
│ 1. Get current │
│ 2. Diff vs desired │
│ 3. Create/update/ │
│ delete │
│ 4. Update status │
│ 5. Requeue │
└──────────┬───────────┘
│
▼
(repeat)
The Reconcile function is the heart. It should be idempotent — calling it multiple times with the same desired state should produce the same result.
What every custom controller has
Whether you use kubebuilder, operator-sdk, or write one from scratch, the pieces are:
1. Informer
An informer watches a resource type and maintains a local cache of the current state. It also emits events (add, update, delete) for changes.
// pseudo-code
informer := factory.Apps().V1().Deployments().Informer()
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { enqueue(obj) },
UpdateFunc: func(old, new interface{}) { enqueue(new) },
DeleteFunc: func(obj interface{}) { enqueue(obj) },
})The informer’s local cache means the controller doesn’t hammer the apiserver for every reconcile. It reads from the cache; writes go through the apiserver.
2. Workqueue
A rate-limited FIFO queue of work items (object keys like “namespace/name”) to be reconciled. The controller has one or more workers pulling items off the queue.
queue := workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter())
// workers
for i := 0; i < workers; i++ {
go func() {
for {
key, _ := queue.Get()
process(key)
queue.Done(key)
}
}()
}If a work item fails, it’s requeued with exponential backoff. This prevents a flapping resource from drowning the controller.
3. Reconciler
The Reconcile(ctx, req) function. Given a request (an object key), it does the work.
func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// 1. Fetch the object
obj := &myv1.MyResource{}
if err := r.Get(ctx, req.NamespacedName, obj); err != nil {
if apierrors.IsNotFound(err) {
return ctrl.Result{}, nil // gone, nothing to do
}
return ctrl.Result{}, err
}
// 2. Reconcile owned resources
if err := r.reconcileDeployment(ctx, obj); err != nil {
return ctrl.Result{}, err
}
if err := r.reconcileService(ctx, obj); err != nil {
return ctrl.Result{}, err
}
// 3. Update status
obj.Status.Ready = true
if err := r.Status().Update(ctx, obj); err != nil {
return ctrl.Result{}, err
}
// 4. Requeue if periodic reconciliation is needed
return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil
}The return values are important:
ctrl.Result{}— done, don’t requeuectrl.Result{Requeue: true}— done, requeue immediatelyctrl.Result{RequeueAfter: 30 * time.Second}— done, requeue in 30s(result, err)witherr != nil— failed, requeue with backoff
4. Owner references
When the controller creates sub-resources (Deployments, Services, etc.), it sets metadata.ownerReferences to the parent CR. This makes garbage collection work:
- When the parent CR is deleted, the sub-resources are deleted automatically
- The controller’s finalizer logic runs before deletion (see below)
dep := &appsv1.Deployment{...}
if err := ctrl.SetControllerReference(parent, dep, r.Scheme); err != nil {
return err
}
if err := r.Create(ctx, dep); err != nil {
return err
}5. Finalizers
When a CR is deleted, the apiserver doesn’t immediately remove it — it waits for the controller to do cleanup. Finalizers are the mechanism.
const myFinalizer = "mycontroller.example.com/finalizer"
func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
obj := &myv1.MyResource{}
if err := r.Get(ctx, req.NamespacedName, obj); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// If being deleted, run finalizer logic
if !obj.DeletionTimestamp.IsZero() {
if containsString(obj.Finalizers, myFinalizer) {
// Run cleanup
if err := r.cleanup(ctx, obj); err != nil {
return ctrl.Result{}, err
}
// Remove our finalizer
obj.Finalizers = removeString(obj.Finalizers, myFinalizer)
if err := r.Update(ctx, obj); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// Add finalizer if not present
if !containsString(obj.Finalizers, myFinalizer) {
obj.Finalizers = append(obj.Finalizers, myFinalizer)
if err := r.Update(ctx, obj); err != nil {
return ctrl.Result{}, err
}
// requeue to do the actual reconciliation
return ctrl.Result{Requeue: true}, nil
}
// ... normal reconciliation
}The flow:
- User deletes the CR
- Apiserver sees
deletionTimestampis set, but a finalizer is present - The CR stays in
Terminatingstate - The controller sees the deletion, runs cleanup
- The controller removes the finalizer
- The apiserver deletes the CR
Without finalizers, deletion happens immediately and the controller’s cleanup never runs. Always add a finalizer if you have external resources to clean up.
6. Leader election
A controller can run as multiple replicas for HA, but only one is active at a time. Leader election is the mechanism.
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
LeaderElection: true,
LeaderElectionID: "my-controller-leader",
LeaderElectionNamespace: "kube-system",
})If the active leader dies, another replica takes over. The transition takes a few seconds (lease TTL).
7. RBAC
The controller needs permission to read the CRs it watches and create / update the resources it manages. You declare this in a ClusterRole and bind it to the controller’s ServiceAccount.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata: { name: my-controller }
rules:
- apiGroups: [mygroup.example.com]
resources: [myresources, myresources/status]
verbs: [get, list, watch, update, patch]
- apiGroups: [apps]
resources: [deployments]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [""]
resources: [services, configmaps, secrets]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [coordination.k8s.io]
resources: [leases]
verbs: [get, list, watch, create, update, patch, delete] # for leader electionThe blast radius of a controller is its RBAC. Keep it minimal.
The simplest possible controller (in Python with kopf)
import kopf
@kopf.on.create('example.com', 'v1', 'foos')
def create_fn(spec, name, namespace, **kwargs):
# create a Deployment for this Foo
deploy = client.AppsV1Api().create_namespaced_deployment(
namespace=namespace,
body={
'metadata': {'name': f'{name}-deploy'},
'spec': {
'replicas': spec.get('replicas', 1),
'selector': {'matchLabels': {'app': name}},
'template': {
'metadata': {'labels': {'app': name}},
'spec': {
'containers': [{
'name': 'foo',
'image': spec['image'],
}],
},
},
},
},
)
return {'deployment': deploy.metadata.name}
@kopf.on.update('example.com', 'v1', 'foos')
def update_fn(spec, status, name, namespace, **kwargs):
# update the Deployment
patch = {'spec': {'replicas': spec.get('replicas', 1)}}
client.AppsV1Api().patch_namespaced_deployment(
name=f'{name}-deploy', namespace=namespace, body=patch,
)
@kopf.on.delete('example.com', 'v1', 'foos')
def delete_fn(spec, name, namespace, **kwargs):
# cleanup (if needed)
passkopf (Kubernetes Operator Pythonic Framework) is the easiest entry point for Python controllers.
The simplest possible controller (in Go with kubebuilder)
Kubebuilder generates the boilerplate. You write the Reconcile:
func (r *FooReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
foo := &examplev1.Foo{}
if err := r.Get(ctx, req.NamespacedName, foo); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// your logic here
return ctrl.Result{}, nil
}
func (r *FooReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&examplev1.Foo{}).
Owns(&appsv1.Deployment{}).
Complete(r)
}For is the primary watch; Owns is a secondary watch for resources we own.
Status subresource
A CRD can have a separate .status field (the “status subresource”):
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
spec:
versions:
- name: v1
subresources:
status: {} # <-- this enables the subresourceWith this:
- Users PUT to
.spec(to change the desired state) - The controller PUTs to
.status(to report the actual state) - They’re separate operations, so a user updating
.specdoesn’t accidentally clobber.status
Generation and observed generation
The apiserver maintains a metadata.generation field that increments on every spec change. The controller reads .status.observedGeneration to know which spec it last reconciled.
if foo.Status.ObservedGeneration != foo.Generation {
// spec has changed since we last reconciled
// re-reconcile
}This is a simple way to detect “the user changed something, I need to re-process”.
Conditions
A status.conditions array is the standard way to report per-aspect health:
foo.Status.Conditions = []metav1.Condition{
{
Type: "Ready",
Status: metav1.ConditionTrue,
Reason: "AllReplicasReady",
Message: "3/3 replicas are ready",
LastTransitionTime: metav1.Now(),
},
{
Type: "Progressing",
Status: metav1.ConditionTrue,
Reason: "Reconciling",
Message: "Scaling up to 5 replicas",
LastTransitionTime: metav1.Now(),
},
}Tools like kubectl get foo foo-1 -o yaml show this. kubectl wait --for=condition=Ready can wait on it.
Event recording
Controllers should record Events for important things (“scaled up from 3 to 5”, “failed to create Service”):
r.Recorder.Event(foo, corev1.EventTypeNormal, "Scaled", "Scaled up from 3 to 5 replicas")
r.Recorder.Eventf(foo, corev1.EventTypeWarning, "Failed", "Could not create Service: %v", err)kubectl describe foo foo-1 shows the events. They’re also shipped to log aggregators.
Metrics
Controllers should expose Prometheus metrics. The most common:
controller_runtime_reconcile_total— total reconciles by result (success / error)controller_runtime_reconcile_errors_total— errorscontroller_runtime_reconcile_time_seconds— histogram of reconcile durationworkqueue_depth— current workqueue depthworkqueue_latency_seconds— time in workqueue
These are the basics; a custom controller can add more (e.g. foo_replicas_desired, foo_replicas_ready).
The full controller stack
A real production controller has all of this:
┌─────────────────────────────────────────────────┐
│ Deployment (controller-manager) │
│ ├─ pod: my-controller-abc │
│ │ │
│ │ my-controller process │
│ │ ├─ Manager (controller-runtime) │
│ │ │ ├─ Leader election │
│ │ │ ├─ Metrics server (Prometheus) │
│ │ │ ├─ Health probe │
│ │ │ ├─ Cache (informers) │
│ │ │ └─ Reconciler(s) │
│ │ │ ├─ Informer for primary CR │
│ │ │ ├─ Informer for owned resources │
│ │ │ ├─ Workqueue │
│ │ │ ├─ Reconcile loop │
│ │ │ ├─ Event recorder │
│ │ │ └─ Status updater │
│ │ │ │
│ │ └─ /metrics endpoint │
│ │ └─ /healthz endpoint │
│ │ │
│ ServiceAccount: my-controller │
│ ClusterRole + ClusterRoleBinding: │
│ - read CRs │
│ - create / update owned resources │
│ - leases (for leader election) │
│ - events (for recording) │
│ │
└─────────────────────────────────────────────────┘
Common controller mistakes
1. Not setting owner references
// WRONG — the sub-resource has no owner
r.Create(ctx, deployment)
// RIGHT — set the owner so GC works
ctrl.SetControllerReference(parent, deployment, r.Scheme)
r.Create(ctx, deployment)2. Not using finalizers
If the controller creates external resources (e.g. a cloud DB instance), and doesn’t add a finalizer, deleting the CR leaves the cloud resource orphaned.
3. Reconciling the wrong object
Reconciling on every update is wasteful. Reconcile on the object you care about (the CR), not on every dependent.
4. Forgetting RequeueAfter
If your controller needs periodic reconciliation (e.g. to check health), return RequeueAfter. Otherwise it only runs when something changes.
5. Long-running reconciliations
A Reconcile that takes 10 minutes blocks the workqueue. Move long operations to a separate goroutine or use Jobs.
6. Reading directly from the apiserver
Use the local cache (informers). Reading from the apiserver in every reconcile is slow and adds load.
7. Tight requeue loops
If you requeue with no delay, the controller loops fast. Use RequeueAfter to slow down.
8. No status updates
If .status never updates, users can’t tell if the controller is working. Update it.
When to write a controller
- You have a custom resource that needs to do something — write a controller
- You’re managing an application’s lifecycle — write a controller
- You want to extend k8s with new behavior — write a controller
- You have automation that runs in CI — it can be a controller, but it might not need to be
When NOT to write a controller
- You can do it with a CronJob + kubectl — don’t write a controller
- You can do it with an admission webhook — admission is sync, controllers are async
- The upstream project provides one — use it
- You don’t have Go / Python / Java skills on the team — find an operator that does what you need
See also
- Operators — when controllers encode operational knowledge
- CRDs — the API extension
- Finalizers — for cleanup
- Garbage Collection — owner references