Architecting AI-Native Kubernetes Clusters with AI Gateways

Sahil Deshmukh
Architecting AI-Native Kubernetes Clusters with AI Gateways

Platform teams running LLM workloads or AI agents on Kubernetes are increasingly facing networking challenges that traditional infrastructure was never designed to handle.

In practice, some teams have seen AI chatbots consume thousands of dollars in API credits over a single weekend not due to a breach, but because of a misconfigured retry loop that a traditional API gateway interpreted as normal traffic. The gateway continued returning HTTP 200 responses, with no visibility into the underlying cost implications.

This is the kind of problem that needs to be addressed at the network level.

According to the CNCF Annual Survey released in January 2026, 82% of container users run Kubernetes in production, and 66% of organizations hosting generative AI models use Kubernetes for some or all inference workloads. Companies like Google, Red Hat, and IBM are among those running production AI infrastructure on Kubernetes today. The platform is clearly the standard. The networking layer, however, has not kept up.

The tools built for routing web app traffic do not work well for AI workloads. A traditional API gateway reads HTTP headers, checks login tokens (JWTs), and blocks bad IP addresses. But it cannot read what is inside the request body. In AI systems, the actual instruction from the user is written in plain text inside the JSON body. The gateway just passes it through, it has no way to tell if the user is asking for a weather update or trying to access a database.

The billing model is equally misaligned, traditional gateways have no visibility into what each request actually costs, which makes financial controls nearly impossible.

To run AI properly in production, smarter logic needs to move into the network layer itself. That is what the AI Gateway does. In 2026, this is fast becoming the standard approach for running AI on Kubernetes.

The Ingress-NGINX Migration Cliff

In March 2026, the Kubernetes community officially retired the Ingress-NGINX controller. As confirmed by Kubernetes SIG Network, there are now no more releases, no bug fixes, and no security patches for Ingress-NGINX. Teams still running it in production are carrying security and compliance risk that grows every month. According to Kubernetes data, about 50% of cloud-native environments were using Ingress-NGINX at the time of retirement.

Teams still using NGINX annotations to manage routing rules  CORS, URL rewrites, rate limits  will find that moving to the Kubernetes Gateway API opens up a much better foundation for AI workloads.

The old Ingress system had a design problem. It mixed infrastructure setup and app routing rules in one file. Different vendors added their own custom annotations to work around the limits. This made routing rules dependent on one specific controller. Switching meant rewriting everything.

The Kubernetes Gateway API fixes this by splitting the work into three clear roles:

  • Infrastructure teams manage the GatewayClass  that tells Kubernetes which controller to use.
  • Platform teams manage the Gateway itself, this sets up ports, TLS certificates, and how traffic enters the cluster.
  • App and ML teams manage their own HTTPRoute files  that define where traffic should go.

No more mixing everything in one file. No more controller lock-in. The community's Ingress2Gateway tool converts old NGINX annotations automatically. Both systems can run at the same time while testing. Once the new setup works, DNS gets switched and the old system is removed.

The examples below are Kubernetes manifest files  YAML configuration files that are applied to a cluster using the kubectl apply -f command. They are not terminal commands.

Legacy Ingress manifest:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-service
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: my-service
            port:
              number: 80

Gateway API HTTPRoute  the new way:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-service
  namespace: default
spec:
  parentRefs:
  - name: main-gateway
  hostnames:
  - "api.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: my-service
      port: 80

The app team owns the HTTPRoute. The platform team owns the Gateway. No annotations. No mixing.

What is a Kubernetes AI Gateway

An AI Gateway on Kubernetes is not a new product bought from a vendor. It is a network gateway that follows the Kubernetes Gateway API standard  but with extra features built in to handle AI traffic.

The Envoy AI Gateway is a strong example. It is an open source project under the CNCF Envoy ecosystem. Instead of requiring every developer to add retry logic, failover code, and token counting into their Python application, the Envoy AI Gateway handles all of that centrally. It can also change the request before it reaches the AI model  adding a system prompt, fixing the output format, or switching to a different model  without touching any app code.

Here is a quick comparison showing how a traditional gateway differs from an AI Gateway:

FeatureTraditional API GatewayKubernetes AI Gateway
Routing modelReads headers and paths onlyUnderstands the content of the request
Best use caseNormal web apps and REST APIsAI models, agents, and RAG pipelines
Load balancingRound-robin, least connectionsBased on GPU memory and cache status
Cost controlCounts requests per secondCounts input and output tokens
SecurityBlocks bad IPs, checks login tokensBlocks prompt injection, removes personal data
Where rules liveInside each appIn the network layer  one central place

Smart Routing with the Gateway API Inference Extension

Old load balancing  sending requests to servers in a round-robin order  does not work well for AI models. You might send a big summarization job to a GPU server that is already full, while another identical server is sitting empty. The system has no way to know.

The Kubernetes community built the Gateway API Inference Extension to fix this. It turns a normal gateway into a smart AI gateway that understands GPU capacity.

It adds a new resource called InferencePool. This groups together all the servers running the same AI model. Here is a simple example using the stable v1 API:

apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: vllm-qwen3-32b
  namespace: default
spec:
  selector:
    app: vllm-qwen3-32b
  targetPortNumber: 8000
  extensionRef:
    name: vllm-qwen3-32b-epp
    port: 9002

This uses the stable v1 version. If you are on an older release, update the apiVersion field.

The extensionRef points to the Endpoint Picker (EPP). The EPP is a filter that sits inside the request path and checks GPU servers in real time before deciding where to send each request. It looks at two things:

  • Warmed Prefix Caches: AI models save work they have already done in a memory cache called a KV (Key-Value) cache. If a user sent a large document earlier, that document is already saved in the memory of the server that processed it. The EPP sends the next question about that same document to that same server. The model does not have to read the document again. This cuts the Time to First Token (TTFT)  the wait time before the model starts writing a reply.
  • Low-Rank Adaptation (LoRA) Adapter Awareness: Many teams run hundreds of fine-tuned AI models using a method called LoRA. These are small add-ons that sit on top of one big base model. The EPP checks which add-ons are already loaded on which servers. It only sends a request to a server where the right add-on is already in memory  so the server does not have to load it fresh, which takes extra time.

Setting up the EPP takes real work. You need to connect your gateway, your metrics system, and your model server together. It is not a quick setup. But once it is running at scale, the improvement in GPU usage makes it one of the most valuable infrastructure changes your team can make.

Semantic Caching and Routing

Once hardware-aware routing is handling which server gets a request, the next step is making the gateway smarter about the content of the request itself.

Semantic Caching

Semantic caching is one of the best ways to reduce AI costs right now. It works in production today.

Normal caching only works if the request is exactly the same word for word. If one user asks "How do I reset my password?" and another asks "What is the password reset process?"  a normal cache misses both times and sends both to the AI model. Both questions mean the same thing.

Semantic caching works differently. It converts the question into a vector, a list of numbers that captures the meaning of the question. Then it checks a vector database (Milvus and ChromaDB are common choices) to see if a similar question was already answered. If the match is close enough, the saved answer is sent back without calling the AI model at all.

In teams that handle many repeated questions  support bots, company knowledge bases, help documentation  this cuts response time from a few seconds to under 100 milliseconds. The cost for those repeated questions drops to almost zero. Running a vector database costs far less than running an AI model for every question.

Semantic Routing

Semantic routing picks the right model based on how hard the question is, instead of skipping the model entirely.

A simple question like "What is the capital of France?" goes to a small, fast, cheap model. A hard question that needs multi-step thinking goes to a bigger, more powerful model. The vLLM Semantic Router adds this feature to Envoy-based setups.

Semantic routing is still new in early 2026. Always set up a backup route and watch the results closely. If the router sends a hard question to a small model by mistake, your users will notice.

FinOps: Controlling AI Costs with Token-Aware Limits

Kubernetes makes scaling easy, which also makes it risky when AI workloads go wrong. A retry loop with a bug can spend thousands of dollars before anyone notices. A standard rate limit that counts requests does not help, it does not know how expensive each request is. Token-aware rate limiting does.

Different gateways use different names for this. Envoy AI Gateway and Kuadrant use TokenRateLimitPolicy. Other gateways may use BackendTrafficPolicy with token rules. The idea is the same:

  • Input Limiting: Controls how many requests can reach the AI model at one time. This stops the queue from getting too full.
  • Output Deduction: As the AI model sends back its answer, the gateway counts how many tokens were used and takes them out of that team's running budget in real time. This is where the actual cost control happens.

Setting Budgets by Namespace and Model

Budgets work best in two layers. First, a total token budget is set for each Kubernetes Namespace. A Namespace usually maps to one team or one product. This creates a hard limit that cannot be exceeded.

Second, limits are set by model. For example, a team might get 500,000 tokens per minute (TPM) for a small local model like Llama-3-8B, but only 10,000 TPM for an expensive cloud model like GPT-4o or Claude Opus. This structure naturally leads developers to use cheap models for simple tasks and save expensive models for when they are actually needed.

What Happens When a Team Runs Out

When a team uses up their token budget, the gateway stops the next request and returns an HTTP 429 error. Most gateways return a plain 429 with no extra information  that can break application error handling.

A properly configured AI Gateway returns a JSON message with type: rate_limit_error and a retry-after header telling the app when to try again. But this needs to be set up manually in the gateway config. It does not happen by default.

On the app side, developers need to retry code that handles 429 errors, wait, try again, wait longer each time. This is called exponential backoff. A better setup is when the gateway automatically switches to a cheaper backup model when the main model's budget runs out, so the user never notices.

Security: Protecting AI at the Network Level

AI workloads carry a different kind of security risk compared to normal web apps. When a user sends a message to an AI, they are giving it instructions. If the AI has access to tools  like running code or reading databases  a harmful instruction can cause real damage.

Stopping Prompt Injection

The main attack type is called prompt injection (OWASP LLM01). This is when someone hides a harmful instruction inside a normal-looking message to trick the AI into doing something it should not. Against an AI that can run code or call APIs, this is as dangerous as someone getting direct access to a production system.

The AI Gateway stops this by checking the content of each message before it reaches the model. Using a TrafficPolicy, rules can be set that block certain patterns.

The example below is specific to kGateway, an open source CNCF project built on Envoy that implements the Kubernetes Gateway API. It is not part of the core Gateway API standard. Envoy AI Gateway and other providers have similar features with slightly different configurations.

Note: The manifest below is a Kubernetes YAML configuration file. It is applied with kubectl apply -f, not run as a terminal command.

TrafficPolicy Manifest  kGateway / Envoy AI Gateway:

apiVersion: gateway.kgateway.dev/v1alpha1
kind: TrafficPolicy
metadata:
  name: openai-prompt-guard
  namespace: kgateway-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: openai-route
  ai:
    promptGuard:
      request:
        customResponse:
          message: "Rejected: Security policy violation."
        regex:
          action: REJECT
          matches:
            - pattern: "ignore previous instructions"
              name: "PromptInjection"

For more complex checks  like checking if a message is toxic or off-topic  the gateway can send the message to a separate safety tool like NVIDIA Guardrails before passing it to the model.

Keeping API Keys Safe and Hiding Personal Data

As your AI agents start calling outside services like OpenAI or Anthropic, you need to manage API keys carefully. Putting API keys directly into your app Pods creates a security risk. Every new deployment is one more place where a key can leak.

The better approach is to use the AI Gateway as the only place that holds and uses API keys. All outbound AI requests go through the gateway. The gateway adds the key for you. App code never sees a raw API key.

For industries with strict data rules, the gateway also removes personal data from requests before they leave your network. Things like patient names, ID numbers, and financial records get masked before the message reaches a cloud AI model. The answer comes back with the masked data restored  so the right people inside your company still see the full context.

In healthcare, tools like Microsoft's Nuance DAX Copilot use this to clean patient data from doctor-patient recordings before sending them to the AI  keeping everything inside HIPAA rules. In finance, HSBC's Dynamic Risk Assessment system  built with Google Cloud  analyzes transactions for suspicious patterns using strict data controls. Organizations running similar cloud AI models use AI Gateways to follow the audit and data boundary rules required by HIPAA, GDPR, and newer AI-specific regulations.

Platform Engineering: Making It Simple for Your Team

You cannot ask your data scientists and ML engineers to also write Kubernetes routing files, set up security filters, and manage rate limit configs. That is too much. When that work falls on the wrong people, things get set up inconsistently and security gaps appear.

The answer is Platform Engineering. Your platform team builds Kubernetes like a product. ML engineers get a simple self-service portal  tools like Backstage or Port are popular choices. When an ML engineer clicks 'Deploy New Model,' the portal runs a pipeline automatically. ArgoCD or Flux handles it. Here is what happens behind the scenes:

  • The model gets deployed and Gateway API routing rules are created
  • Crossplane sets up extra cloud resources  like a Milvus vector database for semantic caching
  • Default security rules are applied to block unknown outbound traffic

The ML engineer just gets a URL and a token budget. They do not touch any of the infrastructure.

Kyverno, a policy checker inside Kubernetes  runs these checks automatically before any new AI deployment goes live: Does this deployment have a token rate limit? Is this AI model approved for the data sensitivity level of this team? Is the PII masking pipeline configured  the same inline data scrubbing covered in the Security section above? If any check fails, the deployment is blocked with a clear message. The problem gets fixed before it ever reaches production.

From Simple Chatbots to AI Agents: MCP and A2A

The first version of enterprise AI was simple from a network point of view. A user sends a message. The model replies. The connection closes. Clean and predictable.

Things are now much more complex. AI agents in 2026 keep memory between sessions, write and run code, read from company databases, call external tools, and work together with other AI agents on long tasks. A single user request might trigger a chain of agents, one to fetch data, one to analyze it, one to write a report  all working at the same time without a human managing each step.

This changes what the network needs to handle. Simple request-response was not built for this. Two open standards now handle it in Kubernetes.

The Model Context Protocol (MCP)

The Model Context Protocol (MCP) is a standard that controls how an AI agent safely connects to outside tools and databases. Before MCP, every connection was built in a custom way. If the agent was attacked through prompt injection, it could take actions using the same permissions as the server it was running on.

The AI Gateway acts as an MCP proxy. It handles login using OAuth 2.0. It sends requests to the right tool server. It uses authorization rules written in Common Expression Language (CEL) to make sure each agent can only use the tools it is allowed to use. Each tool runs in its own container. The platform team can update one tool without touching the agent itself. Popular AI frameworks like LangGraph and CrewAI already support MCP connections that work with this setup.

The Agent-to-Agent (A2A) Protocol

MCP handles the connection between one agent and its tools. The Agent-to-Agent (A2A) protocol  supported by the Linux Foundation  handles the connection between multiple agents talking to each other.

A2A uses standard web tools  JSON-RPC 2.0 and Server-Sent Events (SSE)  to let agents share what they can do, hand off tasks to each other, and send back updates to the agent that gave them the task. Each agent keeps its own internal memory and tools private.

A simple way to think about it: MCP works like Layer 2 networking, it gives an agent direct, close access to specific local resources. A2A works like Layer 3 networking, it routes tasks between agents across different parts of the system. This is just a comparison to help you picture it, not a technical protocol mapping.

TopicModel Context Protocol (MCP)Agent-to-Agent (A2A) Protocol
What it doesConnects one agent to its toolsConnects agents to other agents
How it worksAgent talks to tool serversAgents share tasks and updates
Network comparisonLike Layer 2  direct local accessLike Layer 3  routes across domains
What it usesContext sharing, tool exposureJSON-RPC 2.0, SSE, Agent Cards
In KubernetesAgent reads from a company database via AI GatewayOne agent hands a task off to a specialist agent

Test It Locally Before Going to Production

The good news is that most of these Gateway API features can be tested on your own computer before you push anything to a cloud environment. You can use Minikube or kind on a Linux or Mac machine.

Here is a simple way to get started:

  • Install Minikube and turn on the gateway feature: minikube addons enable ingress
  • Install the Envoy AI Gateway using its Helm chart from the official docs
  • Set up a small InferencePool that points to a local Ollama model or a test server
  • Add a rate limit policy and test what happens when you hit the limit using a curl loop

This gives you a fast way to check your config files, test routing rules, and make sure your 429 error messages are set up correctly  before you go anywhere near a production environment.

One more thing: your regular app servers  Go, Node.js, and Python  keep running through normal HTTPRoute rules alongside AI workloads. The InferencePool only handles AI model traffic. Both can run on the same gateway at the same time without getting in each other's way.

Conclusion:

The tools for building AI networking on Kubernetes are ready and working in production at real companies today. Teams that set up these foundations now will have a much easier time as AI workloads grow.

Here are three things you can start with:

  1. Move to the Gateway API. Use the Ingress2Gateway tool to convert your old NGINX rules. Run both systems at the same time while you test. Switch your DNS once everything checks out. After that, every feature in this article is available to you.
  2. Set up an InferencePool and try the EPP. Connect an Envoy AI Gateway to a local vLLM or Ollama instance. Check your Time to First Token (TTFT) with and without cache-aware routing. The numbers you get make it easy to explain the value to your team.
  3. Add token-based rate limits now. Stop limiting by request count. Switch to token-based limits using TokenRateLimitPolicy or BackendTrafficPolicy token rules. Set up your 429 error messages to return proper JSON with a retry-after header. Make sure your app teams have retry logic in place. This stops the kind of weekend billing problem described at the start of this article.

You do not need to build all of this at once. Start with the Gateway API move and add things step by step. The important thing is to get AI cost controls and security rules out of your app code and into the network layer  where your platform team can manage everything in one place.

Tags
Kubernetes Gateway APIAI Gatewaytoken-aware rate limitingMCPA2A
Maximize Your Cloud Potential
Streamline your cloud infrastructure for cost-efficiency and enhanced security.
Discover how CloudOptimo optimize your AWS and Azure services.
Request a Demo