High-Performance AI on a Budget: Serving Quantized SLMs with ONNX Runtime on Kubernetes

Unlock high-performance AI inference without expensive GPUs. Learn to serve quantized Small Language Models (SLMs) using ONNX Runtime and Kubernetes on standard CPUs.

Photo by Muspi Merol on Unsplash

For the past few years, the narrative surrounding Artificial Intelligence has been dominated by a single piece of hardware: the GPU. From the scarcity of H100s to the skyrocketing costs of cloud instances, CTOs and developers often assume that deploying modern Large Language Models (LLMs) requires a massive investment in specialized silicon.

But the landscape is shifting. With the rise of Small Language Models (SLMs) like Microsoft's Phi-3, Mistral, and Llama 3 (8B), combined with aggressive quantization techniques, the necessity for GPUs is diminishing for many inference tasks. By leveraging ONNX Runtime and the orchestration power of Kubernetes, enterprises can now achieve low-latency, high-throughput AI inference on standard CPU infrastructure.

In this guide, we will explore how to decouple your AI strategy from the GPU shortage, reduce your cloud spend significantly, and maintain high performance by serving quantized models on commodity hardware.

The Power of Less: SLMs and INT4 Quantization

Mathematical formulas are written on the paper. — Photo by Bozhin Karaivanov on Unsplash

The first step in moving away from GPUs is right-sizing the model. While GPT-4 class models are impressive, they are overkill for tasks like summarization, classification, or RAG (Retrieval-Augmented Generation) over corporate data. Small Language Models (SLMs), typically under 10 billion parameters, have achieved performance parity with older, much larger models.

However, running even an 8-billion parameter model in standard 16-bit precision (FP16) requires roughly 16GB of VRAM—still pushing the limits of commodity CPUs regarding memory bandwidth. This is where Quantization enters the picture.

Quantization involves mapping the model's weights from high-precision floating-point numbers (FP32 or FP16) to lower-precision integers (INT8 or INT4). This process offers two massive benefits:

Reduced Memory Footprint: An INT4 quantized model is roughly 4x smaller than its FP16 counterpart. A 7B model can shrink to under 4GB of RAM.
Faster Inference: CPUs can process integer operations significantly faster than floating-point math, especially when leveraging modern instruction sets like AVX-512 or Intel AMX.

By converting a model to ONNX format with INT4 quantization, we shift the bottleneck from compute to memory bandwidth, a battle that modern server-grade CPUs are well-equipped to fight.

Optimizing Inference with ONNX Runtime

a computer screen with a program running on it — Photo by Ilija Boshkov on Unsplash

Once we have a quantized model, we need a runtime capable of executing it efficiently. ONNX Runtime (ORT) is a cross-platform inference engine focused on performance. Unlike standard PyTorch execution, ORT applies graph optimizations—such as node fusion and constant folding—before the model even runs.

For CPU inference, ONNX Runtime is particularly powerful because it can tap into hardware-specific acceleration libraries without changing your code. For example, on Intel Xeon processors, ORT leverages Intel OpenVINO or the Math Kernel Library (MKL) to utilize vector instructions (AVX-512/VNNI).

Here is a simplified example of how you might load a quantized ONNX model in Python for inference:

import onnxruntime as ort
import numpy as np

# Configure session options for CPU performance
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 4  # Tune based on your CPU cores
sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

# Load the quantized model
model_path = "./models/phi-3-mini-4k-instruct-int4.onnx"
session = ort.InferenceSession(model_path, sess_options, providers=['CPUExecutionProvider'])

# Prepare input (tokenized representation)
input_data = {input_name: input_tokens}

# Run inference
result = session.run(None, input_data)

This setup allows you to run sophisticated AI models on standard cloud instances (like AWS c7i or Azure F-series) or even on-premise servers without a single GPU, often achieving sub-50ms latency for classification tasks.

Scaling with Kubernetes: Architecture and Deployment

gray high rise building — Photo by Mathias Reding on Unsplash

Running a model on a laptop is one thing; serving it to thousands of users is another. This is where Kubernetes (K8s) shines. Because CPU nodes are abundant and spin up faster than GPU instances, scaling becomes much more responsive.

When deploying CPU-based inference on K8s, resource management is critical. Unlike GPUs, where you usually dedicate one device per pod, CPUs allow for granular fractional usage. However, to avoid context switching overhead, it is often best to pin CPU cores to pods using the static CPU Manager policy in Kubernetes.

Below is an example of a Kubernetes Deployment manifest optimized for an ONNX-based inference service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: onnx-slm-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: onnx-slm
  template:
    metadata:
      labels:
        app: onnx-slm
    spec:
      containers:
      - name: model-server
        image: nohatek/onnx-server:latest
        resources:
          limits:
            memory: "8Gi"
            cpu: "4"
          requests:
            memory: "4Gi"
            cpu: "2"
        env:
        - name: OMP_NUM_THREADS
          value: "4"
        - name: ORT_INTRA_OP_NUM_THREADS
          value: "4"

Key Architectural Considerations:

Horizontal Pod Autoscaling (HPA): Configure HPA based on CPU utilization or custom metrics (like request queue depth). Since CPU nodes provision quickly, your cluster can react to traffic spikes in seconds.
Node Affinity: Use node affinity to schedule these pods on compute-optimized instances (e.g., Intel Xeon Scalable processors) to maximize the benefits of AVX-512 instructions.
Cost Efficiency: By using Spot Instances for these stateless CPU workloads, you can reduce inference costs by up to 90% compared to on-demand GPU instances.

The assumption that AI requires GPUs is a barrier to entry that many companies no longer need to face. By combining the efficiency of Small Language Models, the precision flexibility of quantization, and the raw speed of ONNX Runtime, IT leaders can deploy robust AI solutions on standard CPU infrastructure.

This approach not only democratizes access to AI but also aligns with sustainable IT practices by maximizing the utility of existing hardware. Whether you are building internal tools, customer-facing chatbots, or intelligent data processing pipelines, the CPU-based stack offers a compelling blend of performance, scalability, and cost-effectiveness.

Ready to optimize your AI infrastructure? At Nohatek, we specialize in high-performance cloud architecture and AI integration. Contact us today to learn how we can help you build scalable, cost-efficient AI solutions.