The same storage model that makes a million tenants cheap can fall apart on a hundred-billion-vector index.
I. Storage Innovation that changed everything
Search infrastructure used to bundle storage, coordination, and query serving into one machine model.
Blob storage broke that bundle.
Not in one shot. But over a few years, the assumptions changed enough that a large class of search systems could be built very differently.
Lets look at the three dates that broadly matter here.
2006 — S3 was launched. Cheap, durable, infinitely scalable object storage. But only supported eventually constituency: you write an object, read it back, and you might get stale data. It was useless for anything that requires coordination. Every serious system treated S3 as a backup tier and the real work happens on local disks managed by stateful clusters running consensus protocols.
December 2020 — S3 gets strong read-after-write consistency. Silently, with no fanfare, AWS removes the main reason S3 could not sit close to the transactional edge of a search system. Now when you write, any subsequent read is guaranteed to see the latest version. The data engineering world notices this. Search world, not yet.
November 2024 — S3 gets conditional writes. If-Match and If-None-Match headers give you compare-and-swap on an object storage. Now you can atomically update an S3 object only if nobody else changed it since you last read it. For single-writer namespaces, this single feature removed a surprising amount of coordination machinery.
The consequence: A large class of search workloads no longer needed a fully stateful cluster that bundles storage, coordination, and query serving into one machine model. Blob stores could handles durability. Strong reads handle freshness. Conditional writes handle a narrow but useful form of coordination. The application layer can become much more stateless than older search systems assumed.
side note - While S3 enabled this in 2024, Azure Blob Storage had this for years. I think it is the timing with GenAI (esp. vector workloads) and the distribution that S3 has that enabled folks to iterate faster with S3 unlock.
What looks like a storage improvement is actually an architectural enablement and It changed how search systems could be built.
It also split search into two problems that may look similar scale problems from the outside and behave very differently underneath. Its worth a deep-dive into these problems.
II. The Zero-Cost Tenant
Here is the number that exposes the shift.
What does it cost to create one more search tenant?
In a traditional stateful search cluster, the answer is never zero. Lets click through this.
Option A: Index per tenant
Each index = primary shard + replica (minimum)
Each shard = Lucene index = file handles + memory-mapped segments
Cluster state grows per shard (held in RAM on every node)
At 10,000 tenants → 20,000 shards → cluster state bloats,
master node under pressure, recovery takes minutes
Marginal cost: ~$0.50-2.00/month/tenant in cluster overhead
Option B: Shared index with tenant_id filter
Cheaper, but noisy neighbors (one tenant's bulk ingest slows all),
can't delete a tenant without reindexing, merge pressure from
mixed-tenant segments
Marginal cost: lower, but operational complexity scales with tenants
Now consider what happens when S3 is the storage layer and compute is stateless:
Creating a tenant = writing a first WAL file to a new S3 prefix
s3://bucket/tenant-abc/wal/001.wal
No cluster state updated. No shard allocated. No memory consumed.
The namespace sits dormant until queried. Zero cost at rest.
Marginal cost: one S3 PUT ($0.000005) + storage ($0.023/GB/month)
At a million tenants, the earlier architecture becomes either a maintenance crisis or an infrastructure tax. The second is mostly a cheap blob storage bill.
This isn't just a cost difference. It changes the product shape. When tenant creation is cheap enough, you can give every workspace, repository, project, or a user its own isolated search index. Features that were hard to justify on stateful clusters - per-user RAG, per-workspace semantic search, per-conversation retrieval - just became much easier to ship.
III. Two Problems, Not One
Here is the part that gets missed. The same blob-storage-native architecture that makes a million tenants cheap has the opposite requirements from making a hundred billion vectors searchable.
Let me build this one up with some assumptions here. Stay with me through the numbers.
Problem 1: A Million Tiny Tenants
Shape: 1,000,000 namespaces, each 1K-100K documents (~3MB-300MB)
Total: 3TB-300TB across all tenants
Active: Only 1-5% queried in any hour
Per-query: ~3MB-30MB (one tenant's data)
Cache: 50K active tenants × 30MB = 1.5TB → fits on NVMe
Hit rate: 95%+
Miss cost: fetch 30MB from S3 ≈ 50ms (acceptable for SaaS)
S3 is the source of truth and an active read path. Cold tenants cost only storage. Cache eviction is by namespace - tenant goes inactive, namespace gets evicted, nobody notices.
Problem 2: A Hundred Billion Vectors in One Index
Shape: 1 namespace, 100B vectors × 1024 dims × 2 bytes = 200 TiB
Access: Uniform — any vector can be queried
Per-query: 500 clusters × 100 vectors × 2KB = 100MB
(spread across the ENTIRE dataset)
Cache: Must hold the ENTIRE quantized index on NVMe
200 TiB (or ~12 TiB quantized) → cluster of storage-dense machines
Hit rate: Must be ~100% — any miss hits S3
Miss cost: fetch 200MB partition from S3 ≈ 285ms per partition
× multiple partitions = seconds (catastrophic)
At this scale, blob storage is for durability only. It should stay off the query hot path. The working index must be pinned in NVMe or DRAM. Cache eviction becomes the enemy.
The Architectural Mismatch
| Dimension | Million Tenants | Billion Vectors |
|---|---|---|
| S3 role | Active read path | Durability only |
| Cache strategy | LRU eviction by namespace | Pin everything, evict nothing |
| Cold tolerance | High (50-200ms fine) | Zero (must avoid S3 reads) |
| Distribution reason | Availability | Capacity |
| Compression priority | Low (data is small) | Critical (must fit in NVMe) |
| Cost driver | Number of namespaces | NVMe capacity |
One API surface. Two fundamentally different machines underneath. Any system trying to serve both with one architecture has to give something up somewhere, usually on multi-tenant cost, large-index performance, or operational simplicity.
IV. Why Vectors Broke the Old Rules
To understand why these two problems diverge so violently, you need to understand why vector search is a different beast than text search - and why the algorithms that won the RAM era are being challenged in the disk era.
Text Search Is Sparse; Vector Search Is Dense
A text query for "rust programming" reads two postings lists from an inverted index. Each is kilobytes to low megabytes. Even from S3, that's single-digit milliseconds. The per-query data appetite is small.
A vector query for [0.12, -0.34, 0.56, ...] must scan a fixed fraction of the entire index - determined by the nprobe parameter, typically 500 clusters of 100 vectors each. That's 100MB of data per tree level, regardless of the query. There is no "cheap vector query."
Per-query data appetite:
Text: ~100KB of postings lists
Vector: ~100MB of cluster data (per level)
Difference: 1,000x
This 1,000x gap is why text search tolerates blob storage latency gracefully and vector search does not. It is also why the architecture that works for per-workspace search in SaaS does not automatically work for large global vector retrieval.
HNSW vs IVF ~~ RAM vs Disk
HNSW (Hierarchical Navigable Small World) graphs are brilliant when the index fits in RAM. Each query traverses a graph, hopping from node to node but each hop is a random memory access. On NVMe, that's 50-100μs per hop. On S3, it's 10-200ms. A single query doing 100-300 hops would take seconds from disk. HNSW's access pattern is adversarial to every level of the memory hierarchy below RAM.
IVF (Inverted File Index) and its descendants (SPFresh, IVF-PQ, DiskANN) work differently. They cluster vectors by proximity, then load contiguous partitions during search. Each partition is a sequential read - exactly what disks, SSDs, and object storage are optimized for.
HNSW: 100-300 random reads per query → great for RAM, terrible for disk
IVF: nprobe sequential reads per query → designed for the memory hierarchy
This is a real shift in what wins. HNSW was the right algorithm when indexes fit in RAM. For billion-scale indexes that live on NVMe or spill toward blob storage, IVF-family algorithms fit the memory hierarchy much better. Systems built around HNSW starting running into a ceiling once indexes stop fitting comfortably in RAM.
Side node - IVF strategy has its own challenges. As the index grows, the centroids become stale and needs to be updated. There are multiple techniques to help there but it is a non trivial work.
Quantization Is Not Optional
At 100 billion vectors × 1024 dimensions × 2 bytes, the raw data is 200 TiB. No single machine, and no affordable cluster, can keep this in DRAM.
Binary quantization (techniques like RaBitQ) compresses each dimension from 16 bits to 1 bit - a 16x reduction. The quantized index drops from 200 TiB to ~12 TiB. Now it fits on a modest cluster of storage-dense machines (each with 10-40 TiB of NVMe).
The recall cost in production is small: binary quantization exploits concentration of measure in high-dimensional space to provide tight error bounds on distance estimates from quantized vectors. In practice, less than 1% of vectors need reranking with full-precision data to maintain high recall.
Quantization is what makes the memory hierarchy math work. Without it, billion-scale vector search requires either absurd amounts of RAM or accepts catastrophic latency from disk. With it, the compressed index lives in NVMe (or even DRAM for the upper tree levels), and only a tiny fraction of full-precision data is fetched from the next tier down.
V. How large Vector search systems may actually work
This blob-storage-native approach is no longer theoretical. Several systems (LanceDB, Turbopuffer and others) have demonstrated parts of the pattern.
The details below draw on some of these general pattern. The key insight is that vector search is bandwidth-bound, not compute-bound. The kernel of a vector search is a dot product: each data element is used exactly once. The arithmetic intensity is low. Most time is spent fetching data, not computing on it. Once you accept this, the entire design follows from one principle: match data placement to the memory hierarchy.
The Hierarchy
Tier Size Bandwidth What lives here
──── ──── ───────── ───────────────
CPU L3 Cache ~500 MB ~600 GB/s Quantized upper centroids
DRAM ~hundreds GB ~300 GB/s Quantized leaf centroids
NVMe SSD ~tens TB ~10 GB/s Full-precision vectors
S3 Unlimited ~1-10 GB/s Durability (never query path)
A hierarchical IVF tree with a branching factor of ~100 - roughly matching the DRAM-to-SSD size ratio - naturally places data at the right tier:
- Upper centroid levels (small, accessed every query) → can stay in L3 cache
- Leaf centroids (medium, accessed every query) → push into DRAM
- Full-precision data (huge, accessed only for <1% reranking) → NVMe SSD
- S3 → cold storage and durability, never on the hot query path
The Compression Multiplier
Without quantization, scanning 500 clusters of full-precision vectors costs 100MB per tree level. At 10 GB/s NVMe bandwidth, that means throughput capped at ~100 QPS.
With binary quantization (16x compression), scanning the same clusters costs 6MB per tree level. But now the quantized data can fit higher in the hierarchy: the upper levels fit in L3 cache (600 GB/s), leaf centroids in DRAM (300 GB/s), and only the <1% reranking fraction hits NVMe.
| Layer | Without Quantization | With Quantization |
|---|---|---|
| Upper centroids | DRAM, ~1,000 QPS | L3 cache, ~33,000 QPS |
| Leaf data | NVMe, ~100 QPS ← bottleneck | DRAM, ~50,000 QPS |
| Reranking (1%) | low/variable | NVMe, ~10,000 QPS |
Compression here doesn't just save space, it also shifts data up the hierarchy, and that is what makes the whole setup viable. The theoretical ceiling goes from 100 QPS to 10,000 QPS. In production, compute becomes the new bottleneck (since binary quantization's bit operations are 64x more arithmetic-intensive), and further performance comes from SIMD instructions like AVX-512 VPOPCNTDQ.
Distribution: As simple as Possible
200 TiB does not fit on one machine. The solution is deliberately simple, randomly shard vectors across different nodes, broadcast queries to all shards and then merge top-k results. But this can work only when single-machine efficiency was maximized first. We know that distribution scales cost linearly and every inefficiency on one machine multiplies across the cluster. The design principle is: exhaust single-node physics before distributing, then distribute with the simplest possible scheme.
VI. Where This Breaks
This blob-storage-native pattern has real limits, and understanding those limits matter.
The Cache Cliff
For the million-tenant workload, cache misses are tolerable - a 50ms S3 fetch for a small namespace is fine for SaaS response times. For the large vector workload, the entire quantized index must be pinned on NVMe. The moment any meaningful fraction falls to S3, performance degrades not by 2x but by 50-200x:
Fully cached (NVMe): Quantized scan from DRAM at 300 GB/s → ~10ms
50% cached: Half the partitions fetched from S3 at ~700 MB/s
285ms per cold partition × multiple partitions
→ 500ms-2s per query
The gap between "everything cached" and "half cached" is catastrophic.
This means the billion-vector workload has a hard NVMe cost floor. At ~$100/TB/month for storage-dense instances, 12 TiB of quantized index costs ~$1,200/month just for SSD capacity - before compute, before network, before redundancy. Compare to S3 at $0.023/GB for the same data: ~$280/month. The 4x cost difference is the price of not tolerating cache misses.
Write Latency
Blob storage roundtrips cost 200-300ms. A WAL commit to blob storage is fundamentally slower than a WAL commit to local NVMe (5-20ms). Workloads requiring sub-second write visibility per namespace are poorly served. For many SaaS and RAG use cases, batching writes (1 commit/second/namespace, up to 10K+ vectors per batch) is acceptable. For real-time log ingestion, it is not. This again requires non-trivial work, if you want to solve for it.
The Unsolved Boundary
The system that transparently handles both modes - millions of small tenants cache-evicted by LRU, and a few massive tenants with pinned NVMe indexes is not yet completely there. Today, the boundary is maybe manual: you split large indexes into fixed-size shards and explicitly pin them to machines. The transition from "cache and evict" to "pin and shard" may still require human judgment and management layers.
S3 as Consensus: The Fine Print
Conditional writes on blob storage can replace a surprising amount of coordination for single-writer-per-namespace workloads. But they are weaker than full consensus models like RAFT:
- No multi-key transactions. You cannot atomically update two S3 objects. Each namespace's WAL and manifest are independently consistent, but cross-namespace atomicity doesn't exist.
- No sub-millisecond coordination. S3 roundtrips are hundreds of milliseconds. Fine for batch-oriented write paths; inadequate for high-frequency coordination. The goal always might be to reduce the round trips.
- Partial failure ambiguity. If an S3 PUT times out, you don't know if it succeeded. Retries with conditional headers are safe (idempotent), but the failure mode is different from Raft's clear commit/abort semantics.
- Single-writer assumption. The model works because each namespace has one writer. Multiple concurrent writers to the same namespace would require external coordination at which point you will have reinvented consensus.
These limitations are acceptable for search workloads where writes are batched and namespaces are independent. They would be fatal for a general-purpose OLTP database.
VII. The Open Questions
The shift is already here. Blob storage is no longer just a backup tier for search. In the right workload it can sit directly under the system. Tenant creation can become cheap. Very large vector search can work with the right compression, caching and some smart management layers.
But there are gaps no one has closed yet.
The spectrum problem. No system today transparently handles the full range - from a million idle tenants to a few tenants with large-vector indexes - in a single architecture with automatic resource management. The pieces are proven: S3 as source of truth, stateless compute, namespace isolation, IVF with quantization, tiered caching. But the orchestration layer that automatically transitions a namespace from "small, cache-evicted, S3-served" to "large, NVMe-pinned, sharded across machines" based on its size and access pattern that is still the next major infrastructure challenge.
The algorithm transition. HNSW-based systems dominate the current market. The shift to IVF-family algorithms for disk-based indexes is happening, but slowly. And as i mentioned earlier, it can have its own challenges to manage.
The cloud provider question. The details are not identical across cloud providers, and the industry story has been told much more loudly through S3 than through equivalent blob storage systems. Whether this settles into a mostly portable blob-storage architecture, or ends up shaped by provider-specific primitives and ecosystems, is still unresolved.
The hybrid workload. Most real applications are not pure text or pure vector. They need filtered vector search, hybrid text+vector retrieval, faceted aggregation on top of ANN. The blob-storage-native pattern is proven for the core search kernel. Whether it extends cleanly to the full feature surface of a production search system, or whether some features still need stateful clusters, is where the most interesting engineering is happening right now.
Same storage layer. Different physics.
That is the part worth paying attention to. Exciting time to be in Search as a space.
March 2026. This essay discusses architectural patterns across the search infrastructure landscape. Technical details reference work from TurboPuffer (ANN v3, January 2026), Quickwit, LanceDB (Lance format, VLDB 2025 / arXiv:2504.15247), and the broader Tantivy/Lucene ecosystem. The S3 consistency timeline references AWS announcements from December 2020 and November 2024. I work on search infrastructure at Microsoft; views are my own.