Unlike **traditional databases** that use a row-column-architecture, **vector-databases** use **vectors **to represent their data. For reference a vector is a **1-dimensional array** of numbers like [0.64, 0.32, 0.97]. This means that there isn’t a **set grid** where the vectors can be placed while still being able to do classical operations. All these vectors get stored in an **embedding **which is like a room with **many dimensions**.

Each **value **in a vector **corresponds **with a **dimension **in the embedding. This way like classical databases you can display **similarities **and **differences **by making vectors have **similar values in one dimension **but **differentiate in another dimension**. If you “look” at an embedding, you will see that **similar objects** group **together **as if they were pulled by a magnet.

**Why are vector-databases superior?**

Thanks to the way that every vector is saved, many different tasks become much more **efficient** than in other databases. Making a **query** is very easy as **similar** **vectors** already are “next” to each other. Before each vector gets added they get “indexed” which maps them to a data-structure and makes searching faster.

When a **query** gets requested, you can use an algorithm like** cosine-similarity** to calculate the **nearest** vectors and output them. Speaking of queries, **filters** can also be added by doing **pre- or postfiltering** which filters the queried results or queries the filtered data. Compared to other database-structures querying shows a **faster** runtime which makes it useful when dealing with **unstructured data** like websites or posts in social-media platforms like Twitter, Facebook or Instagram.

**Vector-databases at code-level**

Let’s look at how we can get the **cosine similarity** of **2 vectors** using Python. We’ll use numpy to get the dot-multiplication of both vectors and then normalize them. You get the similarity by dividing the dot-product with the product of both normalized vectors. Here is the code to do this:

```
import numpy as np
def cosine_similarity(vector1, vector2):
dot_product = np.dot(vector1, vector2)
norm_vector1 = np.linalg.norm(vector1)
norm_vector2 = np.linalg.norm(vector2)
similarity = dot_product / (norm_vector1 * norm_vector2)
return similarity
# Example
vector1 = [0.867, 0.446, 0.236]
vector2 = [0.164, 0.648, 0.948]
similarity = cosine_similarity(vector1, vector2)
#Output: 0.5628393553089471
```

As you can see we got an output from around **0.56** meaning they **aren’t that similar** (1 = 100% match; 0 = 0% match).

**Search query**

Still this isn’t as efficient if you are dealing with **many vectors**. Let’s look at another **algorithm**.

```
# Using KNN algorithm
from sklearn.neighbors import NearestNeighbors
import numpy as np
def find_k_nearest_neighbors(X, query_vector, k):
# Create an instance of NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=k, algorithm='kd_tree').fit(X)
# Find k-nearest neighbors
distances, indices = nbrs.kneighbors(query_vector)
# Return the k-nearest neighbors
return distances, indices
# Example usage
# Generate a random dataset
X = np.random.rand(100, 10)
# Query vector
query_vector = np.random.rand(1, 10)
# Find 5-nearest neighbors
k = 5
distances, indices = find_k_nearest_neighbors(X, query_vector, k)
```

In this code we are generating **100 random vectors** each having** 10 dimensions**. Then we will be doing a **search **query with a vector also having 10 dimensions. To accomplish this, we will be using the **KNN algorithm**. If you want to **learn more** about this algorithm, you can read about it here. Be warned as it this enters a field of **advanced mathematics**. This is the output:

```
print("K-nearest neighbors:")
for i in range(k):
print("Neighbor:", i+1)
print("Distance:", distances[0][i])
print("Index:", indices[0][i])
print("Vector:", X[indices[0][i]])
print()
```

```
# Output:
query_vector = [[0.1113187 0.75351452 0.9374165 0.76445506 0.68385283 0.72236542
0.70277158 0.77592412 0.28326503 0.07586657]]
k = 5
# K-nearest neighbors:
# Neighbor: 1
# Distance: 0.6807253770735626
# Index: 32
# Vector: [0.04212379 0.25497311 0.98374969 0.50389832 0.79652829 0.65142171
# 0.53587406 0.75926397 0.03879164 0.2613137]
# Neighbor: 2
# Distance: 0.8165762180854882
# Index: 8
# Vector: [0.1940449 0.97680781 0.6981634 0.26523264 0.73049437 0.64620033
# 0.52062487 0.44178118 0.6714971 0.06594126]
# Neighbor: 3
# Distance: 0.825618290080734
# Index: 52
# Vector: [0.43931076 0.35308087 0.7583844 0.99293407 0.23197408 0.78826507
# 0.57181333 0.55572981 0.07441403 0.1839082 ]
# Neighbor: 4
# Distance: 0.9181803452355117
# Index: 31
# Vector: [0.84097593 0.94434338 0.66560885 0.55546493 0.78160776 0.61385801
# 0.70407179 0.41658848 0.33249874 0.01417456]
# Neighbor: 5
# Distance: 0.9434755476918664
# Index: 97
# Vector: [0.15430003 0.00618319 0.49802488 0.79061649 0.41424987 0.73832103
# 0.55142175 0.64870304 0.22348331 0.21883166]
```

As we can see the 5-nearest neighbors are listed from most similar to least similar.

### Conclusion

With **more data** being added to the **internet **than ever old and slow databases can’t keep up that fast. That’s why **new **and **efficient **ways have to be found to deal with everything. **Vector-databases** are a brilliant example for this. Even though they aren’t quite new recently they have gone in the spotlight. Many companies and projects with vector databases have been **funded **in the **past few months** like **Pinecone**. Also, **Microsoft **announced that they will start developing **vector search capability in Azure Cognitive Search**, which is a big step forward for a better way to store and search data.

Interested in databases? Find more articles here!