Self-Hosting at Scale: High-Throughput LLM Inference with vLLM, Ray Serve, and Kubernetes

Master the art of self-hosting LLMs. Learn how to architect high-throughput inference pipelines using vLLM, Ray Serve, and Kubernetes to reduce costs and latency.

Self-Hosting at Scale: High-Throughput LLM Inference with vLLM, Ray Serve, and Kubernetes
Photo by Bernd 📷 Dittrich on Unsplash

In the initial phase of the Generative AI boom, the path of least resistance was obvious: hit an API. Services like OpenAI and Anthropic provided the fastest route to MVP. However, as enterprises move from proof-of-concept to production at scale, the calculus changes. The combination of unpredictable latency, skyrocketing token costs, and data sovereignty concerns has driven many CTOs and lead architects toward a new objective: Self-hosting.

But self-hosting Large Language Models (LLMs) is not merely about dockerizing a model and throwing it onto a GPU. The challenge lies in inference efficiency. How do you maximize GPU utilization? How do you handle concurrent requests without queuing delays? How do you scale horizontally when traffic spikes?

This guide explores the current gold-standard architecture for high-throughput LLM inference: the triad of vLLM for memory-efficient serving, Ray Serve for distributed orchestration, and Kubernetes for infrastructure management. At Nohatek, we have helped numerous partners transition to this stack, unlocking performance gains of up to 24x compared to naive HuggingFace pipelines.

What is vLLM? Efficient AI Inference for Large Language Models - IBM Technology

The Business Case: Why Leave the API Garden?

a sign on a wooden door
Photo by Erik Mclean on Unsplash

Before diving into the code, it is vital to understand the architectural drivers. Why undertake the engineering overhead of managing your own AI infrastructure? The decision usually boils down to three factors:

  • Cost at Scale: While APIs are cheap for prototyping, the unit economics break down at high volume. Running a fine-tuned Llama 3 or Mixtral model on reserved GPU instances often yields a significantly lower cost-per-token than commercial APIs once you cross a certain utilization threshold.
  • Latency Control: Public APIs suffer from "noisy neighbor" problems. By controlling the stack, you can optimize for Time to First Token (TTFT) or total throughput depending on your specific application needs (e.g., chatbots vs. batch processing).
  • Data Privacy: For industries like healthcare, finance, and defense, sending PII or proprietary codebases to an external endpoint is a non-starter.
"The goal is not just to own the model, but to own the performance characteristics of that model."

To achieve this ownership, we need a stack that solves the primary bottleneck of LLM inference: KV Cache memory management.

The Engine: vLLM and PagedAttention

The image shows a car engine.
Photo by Ivan Catus on Unsplash

Naive inference implementations often suffer from memory fragmentation. LLMs generate tokens sequentially, and the Key-Value (KV) cache grows dynamically. Traditional serving frameworks reserve contiguous memory blocks based on the maximum possible sequence length, leading to massive waste (often 60-80% of GPU memory is reserved but unused).

Enter vLLM.

vLLM revolutionized inference with an algorithm called PagedAttention. Borrowing from operating system concepts (virtual memory and paging), vLLM allows the KV cache to be stored in non-contiguous memory blocks. This allows the system to batch significantly more requests together because it doesn't need to reserve maximum memory upfront.

Key features of vLLM include:

  • Continuous Batching: Unlike static batching, which waits for all requests in a batch to finish, continuous batching inserts new requests as soon as earlier ones finish.
  • Throughput: In benchmarks, vLLM often delivers 10-24x higher throughput than HuggingFace Transformers.
  • Simplicity: It provides an OpenAI-compatible API server out of the box.

The Orchestrator: Ray Serve on Kubernetes

The Star-Spangled Banner music notes
Photo by Ministries Coordinator on Unsplash

While vLLM handles the single-node performance, we need a layer to handle scaling across multiple nodes and GPUs. This is where Ray Serve shines. Ray provides a Python-native distributed computing framework that pairs perfectly with Kubernetes via the KubeRay operator.

Ray Serve allows us to wrap the vLLM engine in a "Deployment" that can autoscale based on queue depth or GPU utilization. It handles the complex routing and load balancing between the HTTP ingress and the model replicas.

Architectural Pattern

A typical production setup involves:

  1. Head Node: Manages the cluster state and autoscaling logic.
  2. Worker Nodes: GPU-accelerated pods where vLLM runs.
  3. Ingress: Ray Serve exposes a single endpoint that routes traffic to available workers.

Implementation Snippet

Here is a simplified example of how you might wrap vLLM within Ray Serve:

from ray import serve
from vllm import AsyncLLMEngine, EngineArgs

@serve.deployment(ray_actor_options={"num_gpus": 1})
class VLLMDeployment:
    def __init__(self, model: str):
        args = EngineArgs(model=model)
        self.engine = AsyncLLMEngine.from_engine_args(args)

    async def __call__(self, request):
        # Parse request and pass to engine
        prompt = await request.json()
        results = self.engine.generate(prompt["text"], ...)
        return results

# Deploy the application
app = VLLMDeployment.bind(model="meta-llama/Llama-3-8b")

By deploying this on Kubernetes using the KubeRay operator, you gain the ability to define distinct node pools. For example, you can keep your Ray Head node on a cheap CPU instance while your Worker nodes scale up and down on Spot GPU instances (e.g., NVIDIA A100s or L4s) to optimize costs further.

Optimizing for Production: Metrics that Matter

gray and yellow measures
Photo by William Warby on Unsplash

Once deployed, the work shifts to tuning. When architecting for scale, you must distinguish between two competing metrics:

  • Time to First Token (TTFT): How long the user waits before seeing the first word. Crucial for real-time chatbots.
  • Time Per Output Token (TPOT): How fast the text generates after it starts.

Tuning Strategies:

If your goal is maximum throughput (batch processing), you should increase the max_num_seqs in vLLM to fill the GPU memory. This might slightly increase TTFT but will process more tokens per second overall.

If your goal is latency (interactive chat), you might restrict the batch size or use Ray Serve to provision more replicas to keep queue depths near zero.

Furthermore, integrating Prometheus with Ray allows you to visualize these metrics in Grafana. You should set up alerts for GPU Memory Utilization and Ray Actor Queue Depth. If the queue depth spikes, your KubeRay configuration should trigger the provisioning of a new GPU node automatically.

Self-hosting LLMs is no longer a dark art; it is a necessary evolution for enterprises serious about AI integration. By leveraging vLLM for memory efficiency, Ray Serve for distributed orchestration, and Kubernetes for resilient infrastructure, you can build an inference engine that rivals commercial APIs in performance while maintaining complete control over your data and costs.

However, architecting this stack requires deep expertise in both ML Ops and Cloud Infrastructure. At Nohatek, we specialize in building high-performance, self-hosted AI solutions. Whether you need to fine-tune a model or deploy a global inference cluster, our team is ready to help you architect the future.