Part 2: Indexing the Vector Space

Published in

GoPenAI

6 min readJan 4, 2024

This part delves into the nature and applications of vector indexes. We now embark on a journey to explore the intricacies and functionalities of vector indexing.

Differences between index and vector index ?

Index:

Structure: A data structure that organizes data to speed up search and retrieval operations. It’s like a book’s index, pointing to the location of specific information within a database.
Purpose: Optimizes queries for exact matches and range-based searches on scalar (single-value) data.
Data types: Works with numbers, text, dates, etc.
Usage Examples:

Traditional Index:
Finding products with a price between $20 and $35
Retrieving customer records by email address
Ordering search results by relevance score

Vector Index:

Structure: Designed for multidimensional data called vectors, often representing semantic information (e.g., image features, text embeddings).
Purpose: Optimizes queries for similarity searches, finding vectors that are semantically close to a query vector.
Data types: Works with vectors (arrays of numbers representing features or embeddings).
Usage Examples:

Image search: Finding similar images based on visual features
Text search: Retrieving semantically related documen
Recommendation systems: Suggesting similar items based on user preferences

Further understanding on vector indexing

The efficacy of generative AI models hinges upon their ability to access and integrate relevant information from vast data repositories. This necessitates a robust mechanism for efficiently searching and retrieving data points within these repositories. This is where vector indexing plays a crucial role.

Vector indexing offers a structured approach for organizing and accessing high-dimensional vector representations of data. These vectors, essentially numerical fingerprints of data objects, capture the intrinsic meaning and relationships between them. By leveraging the power of vector indexing, generative AI models can pinpoint specific data points within large datasets with remarkable speed and accuracy.

The foundation of this efficiency lies in the embedding process. Vector Embeddings translate data objects, such as text or images, into numerical vectors, thereby allowing for quantitative comparisons and proximity estimations within a multidimensional space. This spatial representation enables the vector index to identify data points that share similar features, effectively clustering them together in vector space.

To enhance the comprehensibility of this concept, consider regarding vectors as coordinates within an n-dimensional space. A practical illustration involves portraying words within a 2-dimensional space. When employing an algorithm that has assimilated word relations or co-occurrence statistics from a corpus, such as GloVe, individual words can be assigned coordinates (vectors) based on their likeness to other words. These algorithms are underpinned by principles from Machine Learning and Natural Language Processing.

The accompanying diagram illustrates a simplified representation of this concept. Notably, the proximity of the words “Apple” and “Banana” is evident, indicative of their similarity. The distance between these words, determined by the separation between their respective vectors, is minimal. Conversely, these fruits exhibit greater separation from the words “Newspaper” and “Magazine.”

Similarity measure

To conduct a search and extract information from a collection of embeddings, it is imperative to establish a methodology for comparing two vectors. This comparison is commonly referred to as a similarity measure or similarity metric. Such metrics ascertain the likeness of two vectors by evaluating the distance or angle between them. Typical similarity metrics employed in vector search are:

Cosine Similarity: This metric gauges the angle formed between the two vectors. The resulting values span from -1 to 1, where a value of 1 indicates that both vectors align in the same direction, -1 signifies opposite directions, and 0 implies orthogonality (perpendicularity).
Dot Product / Inner Product: This metric assesses how well two vectors align with one another. The values extend from -∞ to ∞. Positive values signify that the vectors point in the same direction, negative values indicate opposite directions, and a value of 0 denotes orthogonality.
Euclidean Distance: This metric measures the distance separating two vectors. The values range from 0 to ∞, with zero denoting identical vectors and larger numerical values signifying greater spatial separation between them.

Vector Indexes vs Vector databases

Vector indexes and vector databases are both tools for managing and searching vector data. However, they have different capabilities and are suited for different use cases.

Vector indexes are data structures that enable fast and accurate search and retrieval of vector embeddings from a large dataset of objects. They are typically used in applications where it is important to find the most similar items to a given query vector, such as recommender systems and natural language processing.

Vector databases are full-fledged databases that are designed to store, manage, and search vector data.A vector database may incorporate vector indexes as part of its functionality to enhance the speed and efficiency of similarity searches.

Vector indexing Types

There are several types of vector indexes, each with its own strengths and weaknesses. Some of the most common types include:

Hash-based indexing: These indexes utilize hash functions to map vector embeddings to buckets. Locality-sensitive hashing (LSH) is a popular hash-based indexing technique that groups similar embeddings into the same bucket, enabling efficient approximate nearest neighbor (ANN) search.
Tree-based indexing: These indexes organize vector embeddings into a hierarchical structure, allowing for efficient search by recursively traversing the tree. Approximate nearest neighbor search algorithms, such as the approximate nearest neighbor tree (ANNOY) and the Vantage-Point tree (VP-tree), are commonly used with tree-based indexes.
Product quantization (PQ) / cluster-based or cluster indexing: This technique quantizes vector embeddings to a lower dimensionality, reducing storage requirements and improving search performance. PQ indexes are particularly well-suited for large-scale datasets with high-dimensional embeddings.
Graph-based indexing: These indexes represent vector embeddings as nodes in a graph, connecting similar embeddings with edges. Hierarchical navigable small world (HNSW) is a popular graph-based indexing technique that efficiently navigates the graph to find approximate nearest neighbors.

Various algorithms are more effective for different categories of vector data. However, all of these algorithms contribute to accelerating the vector search process, albeit with a trade-off in terms of accuracy and recall.

Picking the right index

The choice of vector index depends on several factors, including the size of the dataset, the dimensionality of the embeddings, the desired accuracy of the search results, and the computational resources available.

Here’s a general guideline for selecting the appropriate vector index:

For small to medium-sized datasets with low-dimensional embeddings:

Hash-based indexing (e.g., LSH) offers a balance between speed and accuracy, making it a suitable choice.

For large datasets with high-dimensional embeddings:

Product quantization (PQ) can significantly reduce storage requirements and improve search performance.
Graph-based indexing, such as HNSW, can efficiently navigate high-dimensional spaces for ANN search.

When high precision is crucial:

Tree-based indexing algorithms like ANN or VP-tree can provide more accurate nearest neighbor search results.

For real-time applications:

Hash-based indexing and PQ indexes are generally faster and more suitable for real-time scenarios.

This is just a starting point, not a strict rule. To get the best performance, you’ll need to consider tconsider the specific requirements of your application and evaluate the performance of different index types using your dataset and query patterns and explore if a composite index can give your application that extra boost.

Wrapping up

Indexing is an indispensable tool for harnessing the power of vectors and unleashing the transformative potential of LLMs. By mastering these techniques, we can unlock new frontiers in language understanding, knowledge discovery, and intelligent applications that shape our world.

Stay tuned for future explorations as we delve deeper into the exciting intersections of vectors, indexing, DB and large language models!

References

“Vector Indexes for Retrieval and Recommendation” by Qin Lv, Jianfeng Wang, Zhe Wang, Yinan He, Meng Wang, and Wenbin Jiang (2020): https://arxiv.org/abs/1706.00957
“Vector Databases: A Survey” by Jingyu Tan, Meng Wang, Yuqiu Lin, and Xiangwei Chai (2022): https://dl.acm.org/journal/tods
“A Hierarchical Navigable Small World Graph for Efficient Approximate Nearest Neighbor Search” by Qin Lv, Meng Wang, Zhe Wang, Yinan He, Wenbin Jiang, and Xiangwei Chai (2021): https://arxiv.org/abs/1603.09320
“Product Quantization for Nearest Neighbor Search” by Herve Jegou, Maxime Bouteille, and Gilles Douze (2013): https://ieeexplore.ieee.org/document/5432202
“Approximate Nearest Neighbor Search in High-Dimensional Spaces” by Stefan Leifker and Michael Buckhoff (2018): https://arxiv.org/abs/2303.09855