IBM demonstrates extreme scale for content-aware storage with a 100-billion vector database
The storage-centric vector database fits on a single server to help scale retrieval-augmented generation, unlocking new value from enterprise data.
Content-aware storage (CAS) represents a new value-add paradigm for traditional storage systems. CAS, which aligns storage solutions to meet the needs of new AI workloads, is centered around a pushdown of data processing functions. Specifically, CAS handles document vectorization using LLM-based embedding models — a process normally performed outside of the storage system — to support the retrieval augmented generation (RAG) pipeline.
With its CAS offering, IBM is making it faster, easier, and more secure to perform RAG under the same roof as the rest of your data. This new paradigm is a key element of IBM’s vision to integrate AI capabilities directly into enterprise storage systems, enabling businesses to extract untapped value from their proprietary assets without costly infrastructure expansion. “Enterprises can derive unprecedented insights from all of their documents in storage systems,” said Sam Werner, GM IBM Storage. “It really opens the door to the next chapter in leveraging AI technology to drive business outcomes.”
At the core of the CAS solution is the vector database. Vector databases are designed to accelerate semantic searches of data, finding related documents to leverage in AI applications. In collaboration with Samsung and NVIDIA, IBM Research has successfully scaled its prototype platform to serving 100-billion vectors on a single server while maintaining recall precision of over 90% within a query latency of less than 700 milliseconds.
Meeting the needs of RAG
RAG is quickly becoming the de facto technique for enterprises using AI to extract value from proprietary documents. The idea is simple: LLMs augment prompts (context) with user data or domain-specific information to provide tailored answers.
RAG’s primary benefit is low-cost accuracy. It can generate more precise answers without the need for expensive, time-consuming fine tuning. RAG comprises four key elements: a data ingestion pipeline, a vector database, a storage system, and an AI accelerator. The data ingestion pipeline transforms enterprise documents into semantic representations (vectors) by using AI models and AI accelerators. In this process, text is extracted from documents (PDFs, PPTs, and so on) and broken into chunks. An embedding model then turns these chunks into vectors that are held in a vector database.
The vector database organizes the data so that an approximate nearest neighbor (ANN) search can be performed, making it possible to find semantically similar chunks during a RAG search. To retrieve relevant chunks, a user’s query is converted into a vector using the same embedding model that was used to vectorize the stored documents. The vector database is then used to identify neighboring vectors according to some vector distance metric (cosine similarity or L2 distance, for example). The text chunks corresponding to the most relevant vectors are then passed to the LLM as part of the prompt. This approach ensures that responses are grounded in enterprise-specific knowledge, which reduces hallucinations and improves trust in AI outputs.
Challenges of scaling
Today's enterprise storage supports petabytes of capacity, storing billions of files. In the context of CAS, each file is represented by potentially hundreds of vectors that, on aggregate, can quickly reach hundreds of billions. Ultimately, these vectors need to be stored and managed by the CAS vector database.
With exponential growth in AI deployment, databases of this scale are needed to help organize proprietary data for AI consumption, according to Vincent Hsu, CTO and Fellow, IBM Storage. Current vector database solutions on the market can only support billions of vectors by scaling out across tens to hundreds of servers. At this scale unique challenges arise: For example, the lengthy time to index (or reindex) the vector to speed up search, and the rising infrastructure cost for hosting and serving these vectors.
Rethinking the vector database for CAS
IBM's CAS is available for both on-premises and in the cloud deployments. To reduce deployment cost and management complexity, IBM Research undertook a strategy to specifically focus on improving vector density and reindexing time, reducing the number of servers that need to be deployed to support a given number of documents and vectors.
The first part of this approach decouples vector and index storage from the compute performing queries. This allows flexibility in provisioning different ratios of servers for query and storage systems — something only made possible by the IBM Storage Scale high-performance ESS file system.
The IBM Storage Scale System 6000 (ESS 6000) is a high-performance, all-flash storage appliance designed for AI, high-performance computing (HPC), and massive data workloads. ESS supports 4U rack-mount enclosures with up to 48 NVMe FlashCore Modules (FCM) or standard NVMe QLC/TLC drives, each with 7 to 60 TBs capacity. It supports 400 Gb InfiniBand or 200 GbitE (Ethernet) links and utilizes PCIe Gen 5 for faster internal communication. A single ESS canister can support throughput up to 340 GB/s read and 175 GB/s write performance per node and IOPS up to 7M. It also supports NVIDIA GPUDirect® Storage (GDS) for high-speed, direct data delivery to GPU, as well as NVIDIA BlueField-3 DPUs for network offloading.
The second part focuses on leveraging enterprise solid state drives to help achieve higher system-level storage performance. For this effort, IBM Research collaborated with Samsung, a global provider of advanced memory and storage technologies for AI and datacenter infrastructure. To support the ESS system’s high-performance storage requirements, Samsung provided 48 PM9D3a PCIe® Gen5 NVMe™ server SSDs, enabling a balanced architecture designed to sustain demanding data throughput and parallel processing workloads. Built on eighth-generation TLC V-NAND technology, each drive offers up to 30.72TB capacity with sequential read speeds of up to 12,000 MB/s and sequential write speeds of up to 6,800 MB/s. These commercially available, mass-produced enterprise SSDs enable practical deployment in real-world ESS environments while supporting scalable system design.
To support extreme scaling, the IBM Research team built a solution that uses a dynamic hierarchical composition of multiple indexes that can be independently optimized and re-optimized as data is added or removed from the system. This approach also improves fault tolerance and makes incremental updates and index building easier to manage, while still maintaining access to the data. “The question of scale is not just about adding more vectors and making those vectors accessible. It's also about maintaining performance and availability of service as the data grows,” said Daniel Waddington, principal research staff member for storage systems at IBM Research.
The hierarchical index design also lends itself to piecemeal housekeeping. Within the hierarchy, sub-indexes can be rebuilt independently if needed, all without disturbing the overall database. To facilitate this on-the-fly maintenance, the research team uses NVIDIA-based GPUs to improve the performance of rebuilding individual indexes. Index building that takes hours on a CPU can be reduced to minutes on GPUs. The research team paid close attention to maximizing individual GPU utilization and scaling-out across multiple GPU adapters.
By using synthetic data that was carefully generated to "look and feel" like real data (by extracting models of clustering proprieties from real data), the research team has been able to demonstrate loading, indexing, and querying of 100 billion vectors (384 dimensions, full precision floating point). Initial loading and top-level partitioning took nine days, followed by index building using six NVIDIA H200 GPUs over an additional four days. For comparison, the indexing would have taken around 120 days on a 2-socket Intel CPU. The total data footprint on storage (vectors and index) was 153 TiB. The team performed experiments to measure query latency and recall precision, the latter of which was achieved by using brute-force search to extract ground-truth from the massive dataset — which itself took many days to perform. The result: a mean query latency of 694 milliseconds with a recall precision of 90%.
What’s next?
Part of IBM’s strategy for AI is to remove artificial software barriers that prevent enterprises from exposing their data and applications to AI. With CAS, we are taking a crucial part of the RAG pipeline and giving that responsibility to the storage system. And the new indexing capabilities are all integrated into familiar file systems that make the entire system easy to deploy.
IBM and Nvidia are working closely to further reduce the indexing time via GPU acceleration of vector indexing with NVIDIA cuVS. Specific goals include indexing 100B+ vectors within a day; exploring GPU acceleration of data loading and partitioning to reduce ingestion time, from nine days to one day; and exploring strategies to reduce the search latency down to the 50-100ms range at 90% recall for RAG workflows.
“We already have security built into the vector database,” said Hsu. “Now we are scaling up without a huge infrastructure footprint.”
Related posts
- NewsPeter Hess
Donating llm-d to the Cloud Native Computing Foundation
NewsPeter HessBuilding PyTorch-native support for the IBM Spyre Accelerator
Technical noteViji Srinivasan, Raghu Ganti, Mudhakar Srivatsa, Elpida Tzortzatos, Tabari Alexander, and Vaidyanathan SrinivasanQuantum simulates properties of the first-ever half-Möbius molecule, designed by IBM and researchers
NewsPeter Hess, Leo Gross, and Alessandro Curioni