The Inference Scheduler: Architecting High-Throughput LLM Serving with Continuous Batching and vLLM on Kubernetes

Unlock high-throughput LLM serving. Learn how to architect an inference scheduler using vLLM, continuous batching, and Kubernetes to maximize GPU ROI.

Photo by Bernd 📷 Dittrich on Unsplash

In the lifecycle of Generative AI adoption, "Day 1" is the triumph of training or fine-tuning a model that actually works. "Day 2" is the sobering realization of what it costs to run it. For CTOs and AI architects, the transition from a single data scientist's notebook to a production-grade inference service is where the real engineering challenges lie.

The economics of Large Language Models (LLMs) are dictated by GPU utilization. If your expensive H100s are sitting idle waiting for memory transfers or blocked by inefficient scheduling, your ROI plummets. The traditional HTTP request/response model doesn't translate well to the autoregressive nature of LLMs, where token generation is iterative and variable in length.

This post explores the architecture of a modern Inference Scheduler. We will dismantle the limitations of static batching and demonstrate how to build a high-throughput serving layer using vLLM, Continuous Batching, and Kubernetes. Whether you are building an internal enterprise assistant or a customer-facing RAG application, this architecture is key to scaling without breaking the bank.

Accelerating LLM Inference with vLLM - Databricks

The Bottleneck: Why Traditional Batching Fails LLMs

selective focus photography of screwdrivers — Photo by Bogomil Mihaylov on Unsplash

To understand the solution, we must first diagnose the problem with standard inference serving. In traditional deep learning (like ResNet for image classification), input sizes are fixed. You can stack 32 images into a batch, push them through the GPU, and get 32 results simultaneously.

LLMs are different. They are dynamic. One user might ask a simple question requiring a 50-token response, while another asks for a summary of a legal document requiring 2,000 tokens. If you use Static Batching (waiting for a bundle of requests to process together), you encounter two massive inefficiencies:

The Straggler Problem: The batch is only as fast as its longest sequence. If one request takes 5 seconds and the rest take 0.5 seconds, the GPU resources for the short requests are held hostage until the long one finishes.
Memory Fragmentation: LLMs require Key-Value (KV) caching to speed up token generation. In standard implementations, memory is pre-allocated based on the maximum possible sequence length. This leads to massive internal fragmentation—often wasting 60% to 80% of GPU memory.

"In the world of LLM serving, memory bandwidth is usually the bottleneck, not compute. Wasted memory equals wasted throughput."

This inefficiency forces organizations to provision more GPUs than necessary, driving up cloud costs linearly with traffic. To solve this, we need a scheduler that understands the iteration-level mechanics of the transformer architecture.

The Solution: vLLM and Continuous Batching

green and black circuit board — Photo by Harrison Broadbent on Unsplash

Enter vLLM, an open-source library developed at UC Berkeley that has become the gold standard for high-performance inference. It introduces two critical concepts that redefine how we architect the inference scheduler: PagedAttention and Continuous Batching.

PagedAttention is to LLMs what Virtual Memory is to Operating Systems. Instead of allocating contiguous blocks of VRAM for the KV cache, vLLM breaks the cache into non-contiguous blocks. This allows the system to fill the GPU memory completely, eliminating fragmentation. Because memory is managed efficiently, we can fit significantly more concurrent requests (batch size) onto a single GPU.

Continuous Batching (also known as iteration-level scheduling) creates a dynamic flow of requests. Unlike static batching, the scheduler doesn't wait for the entire batch to finish:

The scheduler processes a set of sequences for one iteration (generating one token for each).
If a sequence finishes (generates an EOS token), it is immediately evicted.
A new request from the queue is inserted into that slot immediately for the next iteration.

This results in a GPU that is constantly crunching numbers, maximizing utilization and throughput. In benchmarks, vLLM often delivers 10x to 24x higher throughput compared to HuggingFace Transformers.

Architecting on Kubernetes: The Deployment Pattern

a computer screen with a bunch of lines on it — Photo by Bernd 📷 Dittrich on Unsplash

While vLLM handles the GPU mechanics, Kubernetes orchestrates the scale. Deploying this in a production environment requires a specific architecture to handle the asynchronous nature of the workload. Here is the recommended blueprint for a Nohatek-standard deployment:

1. The Pod Specification

Your deployment should run the vLLM OpenAI-compatible server. This allows you to swap out backend models without changing your frontend client code. Ensure you set resource limits correctly to grant exclusive GPU access.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-vllm
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        resources:
          limits:
            nvidia.com/gpu: "1"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token
        args:
        - "--model"
        - "meta-llama/Llama-3-70b-instruct"
        - "--max-num-batched-tokens"
        - "8192" # Tuning parameter for throughput

2. Horizontal Pod Autoscaling (HPA) with Custom Metrics

Standard CPU/Memory scaling doesn't work well for LLMs because the GPU is almost always "busy" even if the queue is empty. Instead, you should use KEDA (Kubernetes Event-driven Autoscaling) or Prometheus metrics to scale based on:

Pending Request Queue Depth: If requests are queuing up at the load balancer, add pods.
KV Cache Usage: vLLM exposes metrics on GPU cache usage. If cache utilization stays above 95%, you are saturation-bound and need more replicas.

3. The Gateway Layer

Place an ingress controller or API Gateway (like NGINX or Traefik) in front of your service. This layer handles authentication, rate limiting, and crucially, timeout management. LLM requests can take seconds to process; ensure your gateway timeouts are configured to tolerate the Time To First Token (TTFT) latency requirements of your SLA.

Building an inference scheduler is no longer about just wrapping a Python script in a Docker container. It requires a deep understanding of hardware utilization and distributed systems. By leveraging vLLM's PagedAttention for memory efficiency and Kubernetes for elastic scaling, organizations can turn a cost-prohibitive AI experiment into a sustainable, high-throughput production service.

The difference between a standard deployment and an optimized inference scheduler is often a 50-70% reduction in inference costs. As models grow larger, this efficiency isn't just a nice-to-have; it's a survival requirement.

Ready to optimize your AI infrastructure? At Nohatek, we specialize in architecting high-performance cloud environments for Generative AI. Whether you need to fine-tune Llama 3 or deploy a scalable RAG pipeline, our team is ready to help you build the future.