Scaling Beyond RAM: Architecting Low-Latency Disk-Based Vector Search for 100 Billion Embeddings

Learn how to overcome the RAM bottleneck in AI infrastructure. We explore architecting disk-based vector search for 100B+ embeddings using NVMe, DiskANN, and Kubernetes.

Photo by Logan Voss on Unsplash

In the era of Retrieval-Augmented Generation (RAG) and massive Large Language Models (LLMs), data isn't just growing—it is exploding in dimensionality. For CTOs and systems architects, the challenge has shifted from simply storing data to retrieving it with millisecond latency to feed hungry AI models. The standard approach for vector search has long been HNSW (Hierarchical Navigable Small World) graphs entirely resident in RAM. This works beautifully for 10 million vectors. It works adequately for 100 million.

But what happens when you hit 100 billion embeddings?

At that scale, the "RAM-only" architecture hits a hard financial and physical wall. Storing 100 billion dense vectors in memory requires hundreds of terabytes of RAM, translating to an infrastructure bill that can bankrupt a project before it launches. The solution lies not in buying more RAM, but in smarter architecture: Disk-Based Vector Search orchestrated on Kubernetes. In this deep dive, we will explore how to leverage modern NVMe SSDs and algorithms like DiskANN to achieve RAM-like performance at a fraction of the cost, and how to manage this beast using Kubernetes.

Full Vision IAS Monthly Magazine November 2025 in English | NoName IAS - NoName IAS

The Math of the RAM Bottleneck

a black and white photo of a ram — Photo by Shanoaleigh Marson on Unsplash

To understand why a shift to disk is inevitable for hyperscale AI, we have to look at the raw numbers. Let’s assume you are using OpenAI’s text-embedding-3-small (1536 dimensions) or an open-source equivalent. Storing a single uncompressed vector with 4-byte floating-point precision takes roughly 6KB. That sounds negligible until you multiply it by 100 billion.

The Calculation: 100,000,000,000 vectors × 6KB ≈ 600 Petabytes of raw vector data.

Even with aggressive scalar quantization (reducing floats to 1-byte integers), you are looking at hundreds of Terabytes of active memory. Attempting to keep this hot in RAM requires a fleet of high-memory instances (e.g., AWS r6i.metal or equivalent) that is operationally unwieldy and financially ruinous. Furthermore, the operational complexity of managing the JVM or heap sizes for in-memory databases of this magnitude leads to frequent Garbage Collection pauses and instability.

The bottleneck isn't just capacity; it's the cost-to-latency ratio. While RAM offers nanosecond access, modern NVMe SSDs offer microsecond access. The architectural challenge is bridging that gap so the end-user—or the LLM—doesn't notice the difference.

The Solution: DiskANN and NVMe Optimization

black and white spiral notebook — Photo by Akshat Sharma on Unsplash

The industry is moving toward algorithms specifically designed to exploit the characteristics of modern Solid State Drives (SSDs), specifically protocols like NVMe which support massive parallel I/O queues. The leading paradigm here is DiskANN (and similar implementations like SPANN or Vamana graphs).

Unlike HNSW, which lives entirely in memory, Disk-based algorithms use a hybrid approach:

Compressed Graph in RAM: A highly compressed representation of the navigation graph (often using Product Quantization) sits in memory. This allows the search algorithm to quickly narrow down the neighborhood of the query vector.
Full Vectors on Disk: The full-precision vectors reside on the NVMe SSD. Once the candidate list is identified via the in-memory graph, the system fetches the actual data from the disk to perform the final re-ranking and distance calculation.

Because NVMe drives can handle hundreds of thousands of Input/Output Operations Per Second (IOPS), this "fetch" operation adds minimal latency. However, simply plugging in an SSD isn't enough. You must ensure your infrastructure utilizes asynchronous I/O (like Linux's io_uring) to prevent the CPU from waiting on disk reads. The goal is to saturate the bandwidth of the PCIe bus, turning your disk storage into "slow RAM" rather than "fast storage."

Orchestrating on Kubernetes: StatefulSets and Local PVs

blue and red cargo ship on sea during daytime — Photo by Ian Taylor on Unsplash

Deploying this architecture on Kubernetes requires moving away from standard cloud-native storage patterns. Network-attached block storage (like EBS gp3 or Azure Disk) introduces network latency that kills vector search performance. For 100 billion vectors, you need Local NVMe SSDs physically attached to the compute nodes.

Here is the architectural blueprint for success:

Use Local Persistent Volumes: Configure your Kubernetes cluster to use the Local Static Provisioner. This exposes the raw NVMe disks on the node directly to the pod, bypassing network storage overhead.
Sharding with StatefulSets: You cannot store 100 billion vectors on one node. You must shard the index. Use Kubernetes StatefulSets to manage these shards. Each pod in the StatefulSet is responsible for a specific partition of the vector space.
Node Affinity & Taints: Ensure your vector search pods land specifically on high-IOPS storage-optimized nodes (e.g., AWS i3en or i4i families) using nodeAffinity rules.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vector-db-shard
spec:
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: local-storage
      resources:
        requests:
          storage: 2Ti

The trade-off here is that local storage is ephemeral in the cloud sense—if a node dies, the data is lost. Therefore, your architecture must include a robust replication layer (application-level replication) or a rapid re-hydration mechanism from object storage (S3/GCS) to rebuild shards when a pod is rescheduled.

Tuning for Low Latency: The Last Mile

black and red speedometer at 0 — Photo by Matthias Speicher on Unsplash

Once the infrastructure is in place, the difference between a 200ms query and a 20ms query lies in kernel and application tuning. When architecting for Nohatek clients, we focus on three specific areas of optimization:

Page Cache Management: While we want to bypass the OS cache for random reads to save RAM, using mmap intelligently can allow the OS to cache frequently accessed parts of the graph automatically. However, for strictly disk-based indices, Direct I/O (O_DIRECT) is often preferred to prevent cache thrashing.
Interrupt Balancing: On high-throughput nodes, the CPU can get overwhelmed handling hardware interrupts from the NVMe drive. Ensure irqbalance is configured correctly, or manually pin NVMe interrupts to specific CPU cores to avoid context switching overhead.
Queue Depth Saturation: NVMe drives perform best under load. Your vector database application should be configured to maintain a high queue depth. This means issuing parallel read requests rather than serial ones.

By treating the disk as a first-class citizen in the memory hierarchy, we can achieve 95% of the recall of a RAM-based system at roughly 10% of the infrastructure cost.

Scaling vector search to 100 billion embeddings forces a paradigm shift. It moves the conversation from "how much RAM do we have?" to "how efficiently can we drive I/O?" By combining the algorithmic efficiency of DiskANN with the orchestration power of Kubernetes and Local NVMe storage, enterprises can build AI infrastructure that is both performant and economically sustainable.

At Nohatek, we specialize in solving these exact high-scale architectural challenges. Whether you are building the next generation of RAG applications or migrating legacy search systems to the cloud, our team can help you navigate the complexities of Kubernetes and high-performance computing. Don't let infrastructure costs throttle your AI innovation.

Ready to architect for scale? Contact Nohatek today to discuss your vector search strategy.