ChromaDB vs PGVector: The Epic Battle of Vector Databases

5 min readOct 8, 2024

Welcome, data warriors, to the showdown of the century — ChromaDB vs PGVector! It’s the battle between two giants of vector databases, fighting for supremacy in the world of AI-powered data processing. 🎮💥

Cue dramatic music

In this blog, we’ll cover:

What the heck ChromaDB and PGVector are
Their performance in the battlefield 🏋️‍♂️
Pros and cons (yes, they have flaws, nobody’s perfect)

So grab your popcorn 🍿, and let’s dive into this epic duel.

Round 1: The Introductions

ChromaDB

Also known as: “The New Kid on the Block” 🍼

ChromaDB is a high-performance, open-source vector database specifically designed for AI applications. If you’re dealing with embeddings from models like GPT, BERT, or any of their brainy cousins, ChromaDB makes it easy to store, index, and retrieve vector data. It’s optimized for lightning-fast searches, making it a go-to for AI/ML nerds working on real-time tasks like recommendation systems and search engines.

💡Fun Fact: ChromaDB is like that ultra-modern sports car — it’s sleek, fast, and has everything fine-tuned for vectors.

PGVector (PostgreSQL as Vector Store)

AKA: “The Veteran With a Makeover” 👴

PGVector is a PostgreSQL extension that adds vector search capabilities to good ol’ PostgreSQL. It’s like giving grandpa a new pair of sneakers and watching him run a marathon. 🏃‍♂️ PostgreSQL is already a beast when it comes to handling relational data, but PGVector extends it by letting you store, index, and search high-dimensional vector data, which is crucial in AI models.

💡Fun Fact: PGVector turns the world’s most boring database into a vector-searching ninja. 🥷

Round 2: Performance — Who’s Faster? 🏎️💨

Now comes the important question: Who’s faster? Well, it depends on your use case, but here’s the TL;DR version:

ChromaDB: Built from the ground up for vectors. It’s got performance optimization baked in, which means it’s faster for querying large sets of embeddings. It also supports in-memory operations, which gives it a performance edge when you need low-latency queries.
PGVector: It’s still PostgreSQL at heart. For simple use cases or projects where you’re already using Postgres, it’s a solid choice. But if you’re dealing with large-scale vector data or real-time AI tasks, you might find it slower than its competitors. PGVector is more of a Swiss Army knife 🛠️ — it can do a lot, but it may not be as fast for pure vector stuff.

Round 3: The Pros and Cons 🤔

ChromaDB Pros 🥇

Blazing Fast: ChromaDB is like Usain Bolt in the world of vector search — speedy!
AI-Friendly: Specifically designed for AI use cases like embeddings.
In-Memory: For super-fast querying (say goodbye to long wait times).
Scalable: Easily handles massive datasets.

ChromaDB Cons 🥉

Specialized Tool: It’s great for vectors, but if you need to manage relational data too, you’ll have to run another database alongside it.
Less Mature: Since it’s a newer tool, the community and ecosystem aren’t as developed as PostgreSQL.

PGVector Pros 🥇

Postgres Powers: You get all the awesome features of PostgreSQL, plus vectors! 🎉
SQL for Days: You can mix relational data with vectors using standard SQL queries.
Great for Light Use Cases: If you’re not dealing with a billion vectors, it does the job well.
Mature: Backed by PostgreSQL’s huge ecosystem and community.

PGVector Cons 🥉

Slower for Vectors: Not as fast as ChromaDB when dealing with large, high-dimensional data.
Setup: Requires PostgreSQL expertise, which can be tricky for some newbies. 😵‍💫

Round 4: Code Showdown ⚔️

Here’s a quick look at how both of these giants would work in code. Let’s assume we’re storing and searching through vectors. Ready? Let’s do this! 🤓

ChromaDB Sample Code

import chromadb

# Create a client
client = chromadb.Client()

# Create a collection (kind of like a table, but for vectors)
collection = client.create_collection("movies")

# Insert some vectors
collection.add(
    ids=["1", "2", "3"],
    embeddings=[[0.1, 0.2], [0.2, 0.1], [0.9, 0.8]],
    metadatas=[{"title": "Inception"}, {"title": "Matrix"}, {"title": "Avatar"}],
)

# Query by vector
results = collection.query(
    query_embeddings=[[0.1, 0.2]],
    n_results=1
)
print(results)

🛠️ Setup: Pretty simple, right? Just focus on vectors, and ChromaDB handles the rest.

PGVector Sample Code

from langchain_core.documents import Document
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector

connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"  # Uses psycopg3!
collection_name = "my_docs"

vector_store = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

docs = [
    Document(
        page_content="there are cats in the pond",
        metadata={"id": 1, "location": "pond", "topic": "animals"},
    ),
    Document(
        page_content="ducks are also found in the pond",
        metadata={"id": 2, "location": "pond", "topic": "animals"},
    ),
    Document(
        page_content="fresh apples are available at the market",
        metadata={"id": 3, "location": "market", "topic": "food"},
    ),
    Document(
        page_content="the market also sells fresh oranges",
        metadata={"id": 4, "location": "market", "topic": "food"},
    ),
    Document(
        page_content="the new art exhibit is fascinating",
        metadata={"id": 5, "location": "museum", "topic": "art"},
    ),
    Document(
        page_content="a sculpture exhibit is also at the museum",
        metadata={"id": 6, "location": "museum", "topic": "art"},
    ),
    Document(
        page_content="a new coffee shop opened on Main Street",
        metadata={"id": 7, "location": "Main Street", "topic": "food"},
    ),
    Document(
        page_content="the book club meets at the library",
        metadata={"id": 8, "location": "library", "topic": "reading"},
    ),
    Document(
        page_content="the library hosts a weekly story time for kids",
        metadata={"id": 9, "location": "library", "topic": "reading"},
    ),
    Document(
        page_content="a cooking class for beginners is offered at the community center",
        metadata={"id": 10, "location": "community center", "topic": "classes"},
    ),
]

vector_store.add_documents(docs, ids=[doc.metadata["id"] for doc in docs])

# Performing a simple similarity search can be done as follows:
results = vector_store.similarity_search(
    "kitty", k=10, filter={"id": {"$in": [1, 5, 2, 9]}}
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

# You can also transform the vector store into a retriever for easier usage in your chains.

retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 1})
retriever.invoke("kitty")

🛠️ Setup: A bit more work (because Postgres), but once it’s rolling, it’s smooth.

Round 5: The Verdict 🏆

Use ChromaDB if you’re working in AI/ML and need a super-fast, vector-first database that handles embeddings like a pro.
Use PGVector if you’re already using PostgreSQL and want to add vector search without setting up a separate database.

So, which one’s better? It really depends on your needs. If you’re all about speed and working with embeddings daily, ChromaDB is your guy. If you want a flexible solution that plays nice with relational data, PGVector is your go-to.

Moral of the story: Both are awesome. But remember, with great power comes great responsibility (or sometimes just slower queries). 😎

References:

ChromaDB : https://www.trychroma.com/

PGVector: https://python.langchain.com/docs/integrations/vectorstores/pgvector/

I’m glad you enjoyed it! If you want to geek out about AI, vector databases, or just exchange memes, feel free to connect with me on LinkedIn. Let’s connect and nerd out together! 😄