New recipes every week

Turn Complexity Into
Cloud Recipes

Learn Kubernetes, AI, DevOps and DevSecOps the CloudChef way. Practical guides, real-world examples, no fluff.

Free forever No paywall Practical guides Real-world examples
50+Guides
WeeklyNew posts
K8s + AITop topics
FreeAlways
agentic-ai AI Kubernetes Wednesday, June 3, 2026 ⏱ Calculating...

Agentic AI Is Reshaping AI Infrastructure: Why CPU Demand Is Surging on Kubernetes

CC
CloudChef
thecloudchef.io
Agentic AI infrastructure diagram showing GPU inference nodes and CPU orchestration workers on Kubernetes

Everyone talks about GPUs when AI infrastructure comes up—and that made sense when the job was mostly one prompt in, one answer out.

Agentic AI changed the kitchen. Instead of a single chef plating one dish, you now have a head chef (planner), line cooks (tools), runners (API calls), and quality checkers (validators) working in loops until the ticket is done. That coordination burns CPU—lots of it—even when the “creative” step still runs on a GPU.

πŸ‘‰ If you run Kubernetes, platform engineering, or FinOps for AI workloads, the capacity plan you wrote for “inference only” is already out of date.


😀 The Problem

Picture a production cluster tuned for LLM serving:

  • GPU node pool sized for model weights and batch inference
  • CPU requests kept tiny on the “API wrapper” Deployment
  • Autoscaling tied to GPU utilization or token throughput

You ship an agentic workflow—incident investigator, security reviewer, data pipeline copilot—and within a week:

  • CPU throttling on orchestrator pods (planning loops, JSON parsing, state machines)
  • Latency spikes that are not GPU-bound—they are queue depth on tool workers
  • Runaway cost from agents that spawn subprocesses, browsers, or sandboxes per step
  • Noisy neighbors when retrieval, embedding, and tool execution share the same nodes as your control plane

The observability dashboard still shouts “GPU at 40%” while customers complain the agent “feels stuck.”


WARN  agent-orchestrator-7f9c  CPU throttled > 85% for 12m
INFO  gpu-inference-0a2b       GPU util 38%  tokens/s stable
ERROR tool-worker-4d1e         OOMKilled (sandbox + browser subprocess)

πŸ‘‰ The problem is not “we need more GPUs.” It is we planned for a microwave and built a full-service kitchen.


πŸ” Root Cause

Agentic AI is not one model call. It is a control loop:

  1. Observe context (logs, tickets, code, metrics)
  2. Plan the next step
  3. Call tools (CLI, APIs, databases, browsers, MCP servers)
  4. Validate output
  5. Repeat until done or budget exhausted

Each lap adds CPU-heavy work that classic “chat completion” diagrams leave out:

  • Orchestration runtimes — state graphs, workflow engines, retries, idempotency keys
  • Tool execution — shell sandboxes, interpreters, container sidecars, serverless functions
  • Retrieval & data plane — chunking, reranking, vector DB queries, metadata filters (often CPU-first)
  • Integration glue — webhooks, ETL transforms, serialization, policy checks, audit logging
  • Observability — trace export, prompt/response redaction, eval harnesses, cost attribution per step

GPUs still matter for large-model inference and some embedding accelerators—but the marginal dollar and the marginal core-hour increasingly land on CPUs as agents become default in DevOps and security operations. Industry signals (hyperscaler CPU roadmap emphasis, agent frameworks shipping “worker” tiers, and enterprise rollouts of operations agents) all point the same direction: AI infrastructure is becoming a two-pool problem—GPU for thinking bursts, CPU for doing and coordinating.

On Kubernetes, that shows up as mis-sized requests/limits, wrong node selectors, and autoscalers watching the wrong metric.


πŸ“Š How Agentic Workloads Split Across the Cluster

flowchart LR subgraph ingress["Ingress / API"] GW[API Gateway] end subgraph cpu_plane["CPU plane (orchestration + tools)"] ORCH[Agent orchestrator] TOOL[Tool workers] RAG[Retrieval / rerank] OBS[Tracing + policy] end subgraph gpu_plane["GPU plane (inference)"] LLM[Model serving] EMB[Embeddings optional] end GW --> ORCH ORCH --> LLM ORCH --> TOOL ORCH --> RAG TOOL --> OBS ORCH --> OBS RAG --> LLM

Think of it like a restaurant: the GPU is the specialized station for the one dish that needs a blowtorch. The CPU fleet is everyone else—prep, plating, expediting, dishwashing, and the manager walking the floor.


🍳 CloudChef Recipe: Right-Size Agent Infrastructure on Kubernetes

Below is a practical sequence for platform teams. Names are illustrative—swap in your agent runtime (custom controller, LangGraph service, managed agent product, etc.).

Step 1 — Separate node pools (GPU vs agent workers)

Do not schedule browser sandboxes next to kube-system. Taint GPU nodes for inference; use a dedicated CPU pool for agents with a clear label.

apiVersion: v1
kind: Namespace
metadata:
  name: agent-platform
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-orchestrator
  namespace: agent-platform
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-orchestrator
  template:
    metadata:
      labels:
        app: agent-orchestrator
        tier: cpu-agent
    spec:
      nodeSelector:
        workload.cloudchef.io/tier: agent-cpu
      containers:
        - name: orchestrator
          image: registry.example/cloudchef/agent-orchestrator:1.4.0
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              cpu: "4"
              memory: 8Gi
          env:
            - name: MAX_AGENT_STEPS
              value: "25"
            - name: TOOL_TIMEOUT_SECONDS
              value: "120"

Step 2 — Size tool workers for burst CPU, not “leftover” cores

Tool pods are where agents fork processes, run CLIs, or launch headless browsers. Give them explicit limits and a PodDisruptionBudget so one runaway investigation does not drain the pool.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-tool-worker
  namespace: agent-platform
spec:
  replicas: 5
  selector:
    matchLabels:
      app: agent-tool-worker
  template:
    metadata:
      labels:
        app: agent-tool-worker
        tier: cpu-agent
    spec:
      nodeSelector:
        workload.cloudchef.io/tier: agent-cpu
      containers:
        - name: tool-runtime
          image: registry.example/cloudchef/agent-tools:1.4.0
          securityContext:
            runAsNonRoot: true
            allowPrivilegeEscalation: false
          resources:
            requests:
              cpu: "4"
              memory: 8Gi
            limits:
              cpu: "8"
              memory: 16Gi

Step 3 — Autoscale on queue depth and step latency (not GPU %)

For event-driven agents, wire KEDA (or equivalent) to the queue that feeds tool workers—SQS, Kafka, RabbitMQ, or a Redis stream.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: agent-tool-worker-scaler
  namespace: agent-platform
spec:
  scaleTargetRef:
    name: agent-tool-worker
  minReplicaCount: 2
  maxReplicaCount: 40
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/cloudchef-agent-tool-jobs
        queueLength: "30"
        awsRegion: us-east-1

Step 4 — Cap cost and blast radius per run

Agents without budgets are all-you-can-eat buffets. Enforce step limits, token ceilings, and concurrency per tenant at the orchestrator.

# Illustrative guardrails (apply in your agent gateway or admission policy)
export MAX_AGENT_STEPS=25
export MAX_TOOL_CONCURRENCY=3
export MAX_WALL_CLOCK_SECONDS=900
export DENY_TOOLS="raw_exec,host_mount,cluster_admin"

Step 5 — Observe the CPU path like you observe the GPU path

Export traces that include per-step spans: plan, retrieve, tool, validate. Tag spans with tenant, agent_id, and tool_name so FinOps can answer “which workflow burned 400 CPU-minutes?”

kubectl top pods -n agent-platform --containers
kubectl describe node -l workload.cloudchef.io/tier=agent-cpu | grep -A5 "Allocated resources"

✅ Best Practices

  • Two-pool architecture — GPU for model serving; CPU for orchestration, tools, retrieval, and policy
  • Right-size requests — measure p95 step latency under load; avoid 100m CPU requests on orchestrators
  • Queue-backed tool workers — absorb spikes without pinning GPUs idle while CPUs queue
  • Sandbox by default — non-root, no hostPath, network policies between tool workers and sensitive namespaces
  • Budget every loop — max steps, timeouts, and per-tenant concurrency
  • Cache retrieval — identical RAG queries in a loop are duplicate prep work
  • Prefer smaller models for routing — use a cheap CPU-friendly classifier to decide when to escalate to the big GPU model

⚠️ Common Mistakes

  • Treating agent pods like lightweight “API gateways” with 200m CPU requests
  • Autoscaling only on GPU utilization while CPU queues grow
  • Running tools in the same namespace as production data stores without policy
  • Ignoring embedding and rerank services when sizing CPU (they are easy to forget)
  • No per-tenant cost attribution—agents make cloud bills opaque fast
  • Assuming managed agents remove the need for your own worker pool (they often shift, not eliminate, CPU load)

πŸ”— Continue Your CloudChef Journey

If you are building the agent layer on Kubernetes, these recipes pair well with this topic:


πŸ“š References


Content disclaimer

Unless stated otherwise, examples use fictional or illustrative data (cluster names, ARNs, image tags, queue URLs, and account IDs). Always read, adapt, and test commands, manifests, and agent policies in a non-production environment first. In production, use change control, backups, and peer review so you do not cause data loss, secret exposure, misconfiguration, or outages. Agent frameworks and cloud pricing change frequently—validate against your vendor docs before you scale.


πŸ”₯ CloudChef Pro Tip

When leadership asks for “more GPUs,” show them two metrics side by side: GPU utilization on inference and CPU queue time on agent steps. If queue time climbs while GPUs look comfortable, you do not have a model problem—you have an expediter problem. Fix the CPU pool, the tool sandbox limits, and the autoscaler triggers first; only then chase bigger cards.


πŸš€ Final Thoughts

Agentic AI is reshaping AI infrastructure because the unit of work changed—from tokens to tasks. Tasks need coordination, tools, retrieval, and governance, and that stack runs hot on CPUs even when the headline model still lives on a GPU.

Platform engineers who split pools, scale on the right signals, and budget every loop will ship agents that feel fast and survive finance review. Everyone else will keep buying GPUs for a kitchen that needed more line cooks.

πŸ‘‰ Plan for both stations. Your cluster will thank you at 2 a.m.


πŸ”₯ Trending CloudChef Recipes

⭐ Popular CloudChef Recipes

No comments:

Post a Comment

πŸ’‘ Found this useful?

Share it with your Team or DevOps Friends πŸ‘‡