Everyone talks about GPUs when AI infrastructure comes up—and that made sense when the job was mostly one prompt in, one answer out.
Agentic AI changed the kitchen. Instead of a single chef plating one dish, you now have a head chef (planner), line cooks (tools), runners (API calls), and quality checkers (validators) working in loops until the ticket is done. That coordination burns CPU—lots of it—even when the “creative” step still runs on a GPU.
π If you run Kubernetes, platform engineering, or FinOps for AI workloads, the capacity plan you wrote for “inference only” is already out of date.
π€ The Problem
Picture a production cluster tuned for LLM serving:
- GPU node pool sized for model weights and batch inference
- CPU requests kept tiny on the “API wrapper” Deployment
- Autoscaling tied to GPU utilization or token throughput
You ship an agentic workflow—incident investigator, security reviewer, data pipeline copilot—and within a week:
- CPU throttling on orchestrator pods (planning loops, JSON parsing, state machines)
- Latency spikes that are not GPU-bound—they are queue depth on tool workers
- Runaway cost from agents that spawn subprocesses, browsers, or sandboxes per step
- Noisy neighbors when retrieval, embedding, and tool execution share the same nodes as your control plane
The observability dashboard still shouts “GPU at 40%” while customers complain the agent “feels stuck.”
WARN agent-orchestrator-7f9c CPU throttled > 85% for 12m
INFO gpu-inference-0a2b GPU util 38% tokens/s stable
ERROR tool-worker-4d1e OOMKilled (sandbox + browser subprocess)
π The problem is not “we need more GPUs.” It is we planned for a microwave and built a full-service kitchen.
π Root Cause
Agentic AI is not one model call. It is a control loop:
- Observe context (logs, tickets, code, metrics)
- Plan the next step
- Call tools (CLI, APIs, databases, browsers, MCP servers)
- Validate output
- Repeat until done or budget exhausted
Each lap adds CPU-heavy work that classic “chat completion” diagrams leave out:
- Orchestration runtimes — state graphs, workflow engines, retries, idempotency keys
- Tool execution — shell sandboxes, interpreters, container sidecars, serverless functions
- Retrieval & data plane — chunking, reranking, vector DB queries, metadata filters (often CPU-first)
- Integration glue — webhooks, ETL transforms, serialization, policy checks, audit logging
- Observability — trace export, prompt/response redaction, eval harnesses, cost attribution per step
GPUs still matter for large-model inference and some embedding accelerators—but the marginal dollar and the marginal core-hour increasingly land on CPUs as agents become default in DevOps and security operations. Industry signals (hyperscaler CPU roadmap emphasis, agent frameworks shipping “worker” tiers, and enterprise rollouts of operations agents) all point the same direction: AI infrastructure is becoming a two-pool problem—GPU for thinking bursts, CPU for doing and coordinating.
On Kubernetes, that shows up as mis-sized requests/limits, wrong node selectors, and autoscalers watching the wrong metric.
π How Agentic Workloads Split Across the Cluster
Think of it like a restaurant: the GPU is the specialized station for the one dish that needs a blowtorch. The CPU fleet is everyone else—prep, plating, expediting, dishwashing, and the manager walking the floor.
π³ CloudChef Recipe: Right-Size Agent Infrastructure on Kubernetes
Below is a practical sequence for platform teams. Names are illustrative—swap in your agent runtime (custom controller, LangGraph service, managed agent product, etc.).
Step 1 — Separate node pools (GPU vs agent workers)
Do not schedule browser sandboxes next to kube-system. Taint GPU nodes for inference; use a dedicated CPU pool for agents with a clear label.
apiVersion: v1
kind: Namespace
metadata:
name: agent-platform
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-orchestrator
namespace: agent-platform
spec:
replicas: 3
selector:
matchLabels:
app: agent-orchestrator
template:
metadata:
labels:
app: agent-orchestrator
tier: cpu-agent
spec:
nodeSelector:
workload.cloudchef.io/tier: agent-cpu
containers:
- name: orchestrator
image: registry.example/cloudchef/agent-orchestrator:1.4.0
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
env:
- name: MAX_AGENT_STEPS
value: "25"
- name: TOOL_TIMEOUT_SECONDS
value: "120"
Step 2 — Size tool workers for burst CPU, not “leftover” cores
Tool pods are where agents fork processes, run CLIs, or launch headless browsers. Give them explicit limits and a PodDisruptionBudget so one runaway investigation does not drain the pool.
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-tool-worker
namespace: agent-platform
spec:
replicas: 5
selector:
matchLabels:
app: agent-tool-worker
template:
metadata:
labels:
app: agent-tool-worker
tier: cpu-agent
spec:
nodeSelector:
workload.cloudchef.io/tier: agent-cpu
containers:
- name: tool-runtime
image: registry.example/cloudchef/agent-tools:1.4.0
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
cpu: "8"
memory: 16Gi
Step 3 — Autoscale on queue depth and step latency (not GPU %)
For event-driven agents, wire KEDA (or equivalent) to the queue that feeds tool workers—SQS, Kafka, RabbitMQ, or a Redis stream.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: agent-tool-worker-scaler
namespace: agent-platform
spec:
scaleTargetRef:
name: agent-tool-worker
minReplicaCount: 2
maxReplicaCount: 40
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/cloudchef-agent-tool-jobs
queueLength: "30"
awsRegion: us-east-1
Step 4 — Cap cost and blast radius per run
Agents without budgets are all-you-can-eat buffets. Enforce step limits, token ceilings, and concurrency per tenant at the orchestrator.
# Illustrative guardrails (apply in your agent gateway or admission policy)
export MAX_AGENT_STEPS=25
export MAX_TOOL_CONCURRENCY=3
export MAX_WALL_CLOCK_SECONDS=900
export DENY_TOOLS="raw_exec,host_mount,cluster_admin"
Step 5 — Observe the CPU path like you observe the GPU path
Export traces that include per-step spans: plan, retrieve, tool, validate. Tag spans with tenant, agent_id, and tool_name so FinOps can answer “which workflow burned 400 CPU-minutes?”
kubectl top pods -n agent-platform --containers
kubectl describe node -l workload.cloudchef.io/tier=agent-cpu | grep -A5 "Allocated resources"
✅ Best Practices
- Two-pool architecture — GPU for model serving; CPU for orchestration, tools, retrieval, and policy
- Right-size requests — measure p95 step latency under load; avoid 100m CPU requests on orchestrators
- Queue-backed tool workers — absorb spikes without pinning GPUs idle while CPUs queue
- Sandbox by default — non-root, no hostPath, network policies between tool workers and sensitive namespaces
- Budget every loop — max steps, timeouts, and per-tenant concurrency
- Cache retrieval — identical RAG queries in a loop are duplicate prep work
- Prefer smaller models for routing — use a cheap CPU-friendly classifier to decide when to escalate to the big GPU model
⚠️ Common Mistakes
- Treating agent pods like lightweight “API gateways” with 200m CPU requests
- Autoscaling only on GPU utilization while CPU queues grow
- Running tools in the same namespace as production data stores without policy
- Ignoring embedding and rerank services when sizing CPU (they are easy to forget)
- No per-tenant cost attribution—agents make cloud bills opaque fast
- Assuming managed agents remove the need for your own worker pool (they often shift, not eliminate, CPU load)
π Continue Your CloudChef Journey
If you are building the agent layer on Kubernetes, these recipes pair well with this topic:
- AWS DevOps Agent and EKS-minded automation — operations agents in production
- AWS Security Agent review — agentic security workflows and guardrails
- Karpenter + KEDA patterns for burst CPU workers (see CloudChef drafts on autoscaling queues)
π References
- Kubernetes — Manage CPU and memory resources
- KEDA — Kubernetes Event-driven Autoscaling
- AWS DevOps Agent (example of enterprise agentic operations)
- OpenTelemetry — tracing multi-step agent workflows
- Loading additional references…
Content disclaimer
Unless stated otherwise, examples use fictional or illustrative data (cluster names, ARNs, image tags, queue URLs, and account IDs). Always read, adapt, and test commands, manifests, and agent policies in a non-production environment first. In production, use change control, backups, and peer review so you do not cause data loss, secret exposure, misconfiguration, or outages. Agent frameworks and cloud pricing change frequently—validate against your vendor docs before you scale.
π₯ CloudChef Pro Tip
When leadership asks for “more GPUs,” show them two metrics side by side: GPU utilization on inference and CPU queue time on agent steps. If queue time climbs while GPUs look comfortable, you do not have a model problem—you have an expediter problem. Fix the CPU pool, the tool sandbox limits, and the autoscaler triggers first; only then chase bigger cards.
π Final Thoughts
Agentic AI is reshaping AI infrastructure because the unit of work changed—from tokens to tasks. Tasks need coordination, tools, retrieval, and governance, and that stack runs hot on CPUs even when the headline model still lives on a GPU.
Platform engineers who split pools, scale on the right signals, and budget every loop will ship agents that feel fast and survive finance review. Everyone else will keep buying GPUs for a kitchen that needed more line cooks.
π Plan for both stations. Your cluster will thank you at 2 a.m.
No comments:
Post a Comment