Architecting AI-Native Kubernetes Clusters with AI Gateways

Platform teams running LLM workloads or AI agents on Kubernetes are increasingly facing networking challenges that traditional infrastructure was never designed to handle.

In practice, some teams have seen AI chatbots consume thousands of dollars in API credits over a single weekend not due to a breach, but because of a misconfigured retry loop that a traditional API gateway interpreted as normal traffic. The gateway continued returning HTTP 200 responses, with no visibility into the underlying cost implications.

This is the kind of problem that needs to be addressed at the network level.

According to the CNCF Annual Survey released in January 2026, 82% of container users run Kubernetes in production, and 66% of organizations hosting generative AI models use Kubernetes for some or all inference workloads. Companies like Google, Red Hat, and IBM are among those running production AI infrastructure on Kubernetes today. The platform is clearly the standard. The networking layer, however, has not kept up.

The tools built for routing web app traffic do not work well for AI workloads. A traditional API gateway reads HTTP headers, checks login tokens (JWTs), and blocks bad IP addresses. But it cannot read what is inside the request body. In AI systems, the actual instruction from the user is written in plain text inside the JSON body. The gateway just passes it through, it has no way to tell if the user is asking for a weather update or trying to access a database.

The billing model is equally misaligned, traditional gateways have no visibility into what each request actually costs, which makes financial controls nearly impossible.

To run AI properly in production, smarter logic needs to move into the network layer itself. That is what the AI Gateway does. In 2026, this is fast becoming the standard approach for running AI on Kubernetes.

The Ingress-NGINX Migration Cliff

In March 2026, the Kubernetes community officially retired the Ingress-NGINX controller. As confirmed by Kubernetes SIG Network, there are now no more releases, no bug fixes, and no security patches for Ingress-NGINX. Teams still running it in production are carrying security and compliance risk that grows every month. According to Kubernetes data, about 50% of cloud-native environments were using Ingress-NGINX at the time of retirement.

Teams still using NGINX annotations to manage routing rules CORS, URL rewrites, rate limits will find that moving to the Kubernetes Gateway API opens up a much better foundation for AI workloads.

The old Ingress system had a design problem. It mixed infrastructure setup and app routing rules in one file. Different vendors added their own custom annotations to work around the limits. This made routing rules dependent on one specific controller. Switching meant rewriting everything.

The Kubernetes Gateway API fixes this by splitting the work into three clear roles:

Infrastructure teams manage the GatewayClass that tells Kubernetes which controller to use.
Platform teams manage the Gateway itself, this sets up ports, TLS certificates, and how traffic enters the cluster.
App and ML teams manage their own HTTPRoute files that define where traffic should go.

No more mixing everything in one file. No more controller lock-in. The community's Ingress2Gateway tool converts old NGINX annotations automatically. Both systems can run at the same time while testing. Once the new setup works, DNS gets switched and the old system is removed.

The examples below are Kubernetes manifest files YAML configuration files that are applied to a cluster using the kubectl apply -f command. They are not terminal commands.

Legacy Ingress manifest:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-service
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: my-service
port:
number: 80

Gateway API HTTPRoute the new way:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-service
namespace: default
spec:
parentRefs:
- name: main-gateway
hostnames:
- "api.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /v1
backendRefs:
- name: my-service
port: 80

The app team owns the HTTPRoute. The platform team owns the Gateway. No annotations. No mixing.

What is a Kubernetes AI Gateway

An AI Gateway on Kubernetes is not a new product bought from a vendor. It is a network gateway that follows the Kubernetes Gateway API standard but with extra features built in to handle AI traffic.

The Envoy AI Gateway is a strong example. It is an open source project under the CNCF Envoy ecosystem. Instead of requiring every developer to add retry logic, failover code, and token counting into their Python application, the Envoy AI Gateway handles all of that centrally. It can also change the request before it reaches the AI model adding a system prompt, fixing the output format, or switching to a different model without touching any app code.

Here is a quick comparison showing how a traditional gateway differs from an AI Gateway:

Feature	Traditional API Gateway	Kubernetes AI Gateway
Routing model	Reads headers and paths only	Understands the content of the request
Best use case	Normal web apps and REST APIs	AI models, agents, and RAG pipelines
Load balancing	Round-robin, least connections	Based on GPU memory and cache status
Cost control	Counts requests per second	Counts input and output tokens
Security	Blocks bad IPs, checks login tokens	Blocks prompt injection, removes personal data
Where rules live	Inside each app	In the network layer one central place

Smart Routing with the Gateway API Inference Extension

Old load balancing sending requests to servers in a round-robin order does not work well for AI models. You might send a big summarization job to a GPU server that is already full, while another identical server is sitting empty. The system has no way to know.

The Kubernetes community built the Gateway API Inference Extension to fix this. It turns a normal gateway into a smart AI gateway that understands GPU capacity.

It adds a new resource called InferencePool. This groups together all the servers running the same AI model. Here is a simple example using the stable v1 API:

apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
name: vllm-qwen3-32b
namespace: default
spec:
selector:
app: vllm-qwen3-32b
targetPortNumber: 8000
extensionRef:
name: vllm-qwen3-32b-epp
port: 9002

This uses the stable v1 version. If you are on an older release, update the apiVersion field.

The extensionRef points to the Endpoint Picker (EPP). The EPP is a filter that sits inside the request path and checks GPU servers in real time before deciding where to send each request. It looks at two things:

Warmed Prefix Caches: AI models save work they have already done in a memory cache called a KV (Key-Value) cache. If a user sent a large document earlier, that document is already saved in the memory of the server that processed it. The EPP sends the next question about that same document to that same server. The model does not have to read the document again. This cuts the Time to First Token (TTFT) the wait time before the model starts writing a reply.
Low-Rank Adaptation (LoRA) Adapter Awareness: Many teams run hundreds of fine-tuned AI models using a method called LoRA. These are small add-ons that sit on top of one big base model. The EPP checks which add-ons are already loaded on which servers. It only sends a request to a server where the right add-on is already in memory so the server does not have to load it fresh, which takes extra time.

Setting up the EPP takes real work. You need to connect your gateway, your metrics system, and your model server together. It is not a quick setup. But once it is running at scale, the improvement in GPU usage makes it one of the most valuable infrastructure changes your team can make.

Semantic Caching and Routing

Once hardware-aware routing is handling which server gets a request, the next step is making the gateway smarter about the content of the request itself.

Semantic Caching

Semantic caching is one of the best ways to reduce AI costs right now. It works in production today.

Normal caching only works if the request is exactly the same word for word. If one user asks "How do I reset my password?" and another asks "What is the password reset process?" a normal cache misses both times and sends both to the AI model. Both questions mean the same thing.

Semantic caching works differently. It converts the question into a vector, a list of numbers that captures the meaning of the question. Then it checks a vector database (Milvus and ChromaDB are common choices) to see if a similar question was already answered. If the match is close enough, the saved answer is sent back without calling the AI model at all.

In teams that handle many repeated questions support bots, company knowledge bases, help documentation this cuts response time from a few seconds to under 100 milliseconds. The cost for those repeated questions drops to almost zero. Running a vector database costs far less than running an AI model for every question.

Semantic Routing

Semantic routing picks the right model based on how hard the question is, instead of skipping the model entirely.

A simple question like "What is the capital of France?" goes to a small, fast, cheap model. A hard question that needs multi-step thinking goes to a bigger, more powerful model. The vLLM Semantic Router adds this feature to Envoy-based setups.

Semantic routing is still new in early 2026. Always set up a backup route and watch the results closely. If the router sends a hard question to a small model by mistake, your users will notice.

FinOps: Controlling AI Costs with Token-Aware Limits

Kubernetes makes scaling easy, which also makes it risky when AI workloads go wrong. A retry loop with a bug can spend thousands of dollars before anyone notices. A standard rate limit that counts requests does not help, it does not know how expensive each request is. Token-aware rate limiting does.

Different gateways use different names for this. Envoy AI Gateway and Kuadrant use TokenRateLimitPolicy. Other gateways may use BackendTrafficPolicy with token rules. The idea is the same:

Input Limiting: Controls how many requests can reach the AI model at one time. This stops the queue from getting too full.
Output Deduction: As the AI model sends back its answer, the gateway counts how many tokens were used and takes them out of that team's running budget in real time. This is where the actual cost control happens.

Setting Budgets by Namespace and Model

Budgets work best in two layers. First, a total token budget is set for each Kubernetes Namespace. A Namespace usually maps to one team or one product. This creates a hard limit that cannot be exceeded.

Second, limits are set by model. For example, a team might get 500,000 tokens per minute (TPM) for a small local model like Llama-3-8B, but only 10,000 TPM for an expensive cloud model like GPT-4o or Claude Opus. This structure naturally leads developers to use cheap models for simple tasks and save expensive models for when they are actually needed.

What Happens When a Team Runs Out

When a team uses up their token budget, the gateway stops the next request and returns an HTTP 429 error. Most gateways return a plain 429 with no extra information that can break application error handling.

A properly configured AI Gateway returns a JSON message with type: rate_limit_error and a retry-after header telling the app when to try again. But this needs to be set up manually in the gateway config. It does not happen by default.

On the app side, developers need to retry code that handles 429 errors, wait, try again, wait longer each time. This is called exponential backoff. A better setup is when the gateway automatically switches to a cheaper backup model when the main model's budget runs out, so the user never notices.

Security: Protecting AI at the Network Level

AI workloads carry a different kind of security risk compared to normal web apps. When a user sends a message to an AI, they are giving it instructions. If the AI has access to tools like running code or reading databases a harmful instruction can cause real damage.

Stopping Prompt Injection

The main attack type is called prompt injection (OWASP LLM01). This is when someone hides a harmful instruction inside a normal-looking message to trick the AI into doing something it should not. Against an AI that can run code or call APIs, this is as dangerous as someone getting direct access to a production system.

The AI Gateway stops this by checking the content of each message before it reaches the model. Using a TrafficPolicy, rules can be set that block certain patterns.

The example below is specific to kGateway, an open source CNCF project built on Envoy that implements the Kubernetes Gateway API. It is not part of the core Gateway API standard. Envoy AI Gateway and other providers have similar features with slightly different configurations.

Note: The manifest below is a Kubernetes YAML configuration file. It is applied with kubectl apply -f, not run as a terminal command.

TrafficPolicy Manifest kGateway / Envoy AI Gateway:

apiVersion: gateway.kgateway.dev/v1alpha1
kind: TrafficPolicy
metadata:
name: openai-prompt-guard
namespace: kgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: openai-route
ai:
promptGuard:
request:
customResponse:
message: "Rejected: Security policy violation."
regex:
action: REJECT
matches:
- pattern: "ignore previous instructions"
name: "PromptInjection"

For more complex checks like checking if a message is toxic or off-topic the gateway can send the message to a separate safety tool like NVIDIA Guardrails before passing it to the model.

Keeping API Keys Safe and Hiding Personal Data

As your AI agents start calling outside services like OpenAI or Anthropic, you need to manage API keys carefully. Putting API keys directly into your app Pods creates a security risk. Every new deployment is one more place where a key can leak.

The better approach is to use the AI Gateway as the only place that holds and uses API keys. All outbound AI requests go through the gateway. The gateway adds the key for you. App code never sees a raw API key.

For industries with strict data rules, the gateway also removes personal data from requests before they leave your network. Things like patient names, ID numbers, and financial records get masked before the message reaches a cloud AI model. The answer comes back with the masked data restored so the right people inside your company still see the full context.

In healthcare, tools like Microsoft's Nuance DAX Copilot use this to clean patient data from doctor-patient recordings before sending them to the AI keeping everything inside HIPAA rules. In finance, HSBC's Dynamic Risk Assessment system built with Google Cloud analyzes transactions for suspicious patterns using strict data controls. Organizations running similar cloud AI models use AI Gateways to follow the audit and data boundary rules required by HIPAA, GDPR, and newer AI-specific regulations.

Platform Engineering: Making It Simple for Your Team

You cannot ask your data scientists and ML engineers to also write Kubernetes routing files, set up security filters, and manage rate limit configs. That is too much. When that work falls on the wrong people, things get set up inconsistently and security gaps appear.

The answer is Platform Engineering. Your platform team builds Kubernetes like a product. ML engineers get a simple self-service portal tools like Backstage or Port are popular choices. When an ML engineer clicks 'Deploy New Model,' the portal runs a pipeline automatically. ArgoCD or Flux handles it. Here is what happens behind the scenes:

The model gets deployed and Gateway API routing rules are created
Crossplane sets up extra cloud resources like a Milvus vector database for semantic caching
Default security rules are applied to block unknown outbound traffic

The ML engineer just gets a URL and a token budget. They do not touch any of the infrastructure.

Kyverno, a policy checker inside Kubernetes runs these checks automatically before any new AI deployment goes live: Does this deployment have a token rate limit? Is this AI model approved for the data sensitivity level of this team? Is the PII masking pipeline configured the same inline data scrubbing covered in the Security section above? If any check fails, the deployment is blocked with a clear message. The problem gets fixed before it ever reaches production.

From Simple Chatbots to AI Agents: MCP and A2A

The first version of enterprise AI was simple from a network point of view. A user sends a message. The model replies. The connection closes. Clean and predictable.

Things are now much more complex. AI agents in 2026 keep memory between sessions, write and run code, read from company databases, call external tools, and work together with other AI agents on long tasks. A single user request might trigger a chain of agents, one to fetch data, one to analyze it, one to write a report all working at the same time without a human managing each step.

This changes what the network needs to handle. Simple request-response was not built for this. Two open standards now handle it in Kubernetes.

The Model Context Protocol (MCP)

The Model Context Protocol (MCP) is a standard that controls how an AI agent safely connects to outside tools and databases. Before MCP, every connection was built in a custom way. If the agent was attacked through prompt injection, it could take actions using the same permissions as the server it was running on.

The AI Gateway acts as an MCP proxy. It handles login using OAuth 2.0. It sends requests to the right tool server. It uses authorization rules written in Common Expression Language (CEL) to make sure each agent can only use the tools it is allowed to use. Each tool runs in its own container. The platform team can update one tool without touching the agent itself. Popular AI frameworks like LangGraph and CrewAI already support MCP connections that work with this setup.

The Agent-to-Agent (A2A) Protocol

MCP handles the connection between one agent and its tools. The Agent-to-Agent (A2A) protocol supported by the Linux Foundation handles the connection between multiple agents talking to each other.

A2A uses standard web tools JSON-RPC 2.0 and Server-Sent Events (SSE) to let agents share what they can do, hand off tasks to each other, and send back updates to the agent that gave them the task. Each agent keeps its own internal memory and tools private.

A simple way to think about it: MCP works like Layer 2 networking, it gives an agent direct, close access to specific local resources. A2A works like Layer 3 networking, it routes tasks between agents across different parts of the system. This is just a comparison to help you picture it, not a technical protocol mapping.

Topic	Model Context Protocol (MCP)	Agent-to-Agent (A2A) Protocol
What it does	Connects one agent to its tools	Connects agents to other agents
How it works	Agent talks to tool servers	Agents share tasks and updates
Network comparison	Like Layer 2 direct local access	Like Layer 3 routes across domains
What it uses	Context sharing, tool exposure	JSON-RPC 2.0, SSE, Agent Cards
In Kubernetes	Agent reads from a company database via AI Gateway	One agent hands a task off to a specialist agent

Test It Locally Before Going to Production

The good news is that most of these Gateway API features can be tested on your own computer before you push anything to a cloud environment. You can use Minikube or kind on a Linux or Mac machine.

Here is a simple way to get started:

Install Minikube and turn on the gateway feature: minikube addons enable ingress
Install the Envoy AI Gateway using its Helm chart from the official docs
Set up a small InferencePool that points to a local Ollama model or a test server
Add a rate limit policy and test what happens when you hit the limit using a curl loop

This gives you a fast way to check your config files, test routing rules, and make sure your 429 error messages are set up correctly before you go anywhere near a production environment.

One more thing: your regular app servers Go, Node.js, and Python keep running through normal HTTPRoute rules alongside AI workloads. The InferencePool only handles AI model traffic. Both can run on the same gateway at the same time without getting in each other's way.

Conclusion:

The tools for building AI networking on Kubernetes are ready and working in production at real companies today. Teams that set up these foundations now will have a much easier time as AI workloads grow.

Here are three things you can start with:

Move to the Gateway API. Use the Ingress2Gateway tool to convert your old NGINX rules. Run both systems at the same time while you test. Switch your DNS once everything checks out. After that, every feature in this article is available to you.
Set up an InferencePool and try the EPP. Connect an Envoy AI Gateway to a local vLLM or Ollama instance. Check your Time to First Token (TTFT) with and without cache-aware routing. The numbers you get make it easy to explain the value to your team.
Add token-based rate limits now. Stop limiting by request count. Switch to token-based limits using TokenRateLimitPolicy or BackendTrafficPolicy token rules. Set up your 429 error messages to return proper JSON with a retry-after header. Make sure your app teams have retry logic in place. This stops the kind of weekend billing problem described at the start of this article.

You do not need to build all of this at once. Start with the Gateway API move and add things step by step. The important thing is to get AI cost controls and security rules out of your app code and into the network layer where your platform team can manage everything in one place.