Prompt Engineering book, Chapter 10: Vector Databases & RAG

May 13, 2026

Introduction: The “Long-Term Memory” of AI

In Part 1, we learned about Context Engineering, where we manually injected data into a prompt. But what if you have 10,000 documents? You can’t put them all into a single prompt. This is where Vector Databases and RAG (Retrieval-Augmented Generation) come in.

In 2026, a Vector DB is the “Long-Term Memory” of your AI system. It allows the model to “look up” information from a massive external knowledge base in milliseconds, ensuring that its answers are grounded in real-time, domain-specific facts rather than its own (sometimes outdated) training data.

Deep Technical Analysis: The Retrieval Engine

The shift from “Simple Vector Search” to “Production RAG” is built on three technical pillars:

1. Vector Embeddings (Semantic Space)

A vector is a list of numbers that represents the Meaning of a piece of text. In 2026, we use Multimodal Embeddings that can represent text, images, and audio in the same mathematical space. When a user asks a question, we “embed” it and find the documents whose vectors are “closest” (usually using Cosine Similarity).

2. Hybrid Search (BM25 + Dense Vectors)

Research has shown that “Semantic Search” (vectors) is great for intent but bad for exact keywords (like product IDs or rare names). Modern systems use Hybrid Search, which combines:

Dense Vectors: For “The spirit of the query.”
BM25 / Sparse Vectors: For “The letter of the query.”
Metadata Filtering: For “The context of the query” (Date, UserID, Permissions).

3. Reranking and “Long-Context” RAG

The “Vector DB” usually returns the top 10-20 results. However, research (Liu et al., 2024) shows that models get confused by too much context. We use a Cross-Encoder Reranker to carefully score those 20 results and only pick the top 3-5 that are most relevant to the specific question, drastically reducing the hallucination rate.

Why Vector DBs Solve Real-World Problems

In practice, Vector Databases and RAG solve several critical production issues:

The “Knowledge Cutoff”: Models like GPT-4 are frozen in time. A Vector DB can store a news article from 5 minutes ago, giving your AI “instant knowledge” of current events.
Private Data Security: You can store sensitive company data in a Vector DB and use Metadata Filters to ensure the AI only “sees” the data that the current user is allowed to access.
Hallucination Prevention: By providing the model with “Ground Truth” context, you shift the model’s task from “Inventing an answer” to “Summarizing a fact.” This is the most effective way to build reliable AI.

Practical Implementation: 8 Python Examples

These examples demonstrate how to build and optimize RAG systems using modern Python patterns and databases.

Example 1: Basic Semantic Search logic

Problem: You need to find the most relevant document for a user question based on “Meaning” rather than “Keywords.” Solution: Use an embedding model to convert text to vectors and calculate the similarity.

import numpy as np
from typing import List, Dict
from sentence_transformers import SentenceTransformer, util

# 1. Initialize a lightweight embedding model
# In 2026, MiniLM is used for fast local search, while text-embedding-3 is for cloud.
model = SentenceTransformer(’all-MiniLM-L6-v2’)

def find_semantically_closest(query: str, corpus: List[str], top_k: int = 1) -> List[str]:
    “”“
    Demonstrates vector similarity search.

    Approach:
    1. Embed query and corpus into vector space.
    2. Use Cosine Similarity to find proximity.
    “”“
    # Convert text to tensors
    query_emb = model.encode(query, convert_to_tensor=True)
    corpus_embs = model.encode(corpus, convert_to_tensor=True)

    # Calculate scores (range 0.0 to 1.0)
    scores = util.cos_sim(query_emb, corpus_embs)[0]

    # Get indices of top results
    top_indices = np.argsort(-scores.cpu())[:top_k]

    return [corpus[i] for i in top_indices]

# Execution Example
if __name__ == “__main__”:
    docs = [”How to pay your bill.”, “Our office is in NYC.”, “Resetting your password.”]
    # match = find_semantically_closest(”settle my account”, docs)
    # print(f”Best Match: {match}”)

Why this is preferred: It understands Synonyms. Even if the user doesn’t use the word “pay,” the semantic vector for “How do I settle my account?” will still match the billing document.

Example 2: Hybrid Search (Keywords + Vectors)

Problem: A user searches for a specific product ID like “SKU-9901”. Vector search might return “Running Shoes” instead of the exact SKU.Solution: Combine vector search with a traditional “Keyword” search (BM25) and a “Reciprocal Rank Fusion” (RRF) algorithm.

from __future__ import annotations

from dataclasses import dataclass
from typing import List, Dict, Any, Tuple
from collections import defaultdict

import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


# -----------------------------
# Data Model
# -----------------------------

@dataclass
class Document:
    id: str
    text: str
    metadata: Dict[str, Any]


# -----------------------------
# Hybrid Retriever
# -----------------------------

class HybridRetriever:
    """
    Production-style Hybrid Search:
    - Dense vector retrieval
    - BM25 sparse retrieval
    - Reciprocal Rank Fusion (RRF)
    """

    def __init__(
        self,
        documents: List[Document],
        embedding_model: str = "all-MiniLM-L6-v2",
    ):
        self.documents = documents

        # Dense embedding model
        self.model = SentenceTransformer(embedding_model)

        # Precompute document embeddings
        self.doc_texts = [doc.text for doc in documents]
        self.doc_embeddings = self.model.encode(
            self.doc_texts,
            convert_to_numpy=True,
            normalize_embeddings=True,
        )

        # BM25 setup
        tokenized = [text.lower().split() for text in self.doc_texts]
        self.bm25 = BM25Okapi(tokenized)

    # -----------------------------
    # Dense Retrieval
    # -----------------------------

    def dense_search(
        self,
        query: str,
        top_k: int = 10,
    ) -> List[Tuple[Document, float]]:
        """
        Semantic similarity search using cosine similarity.
        """

        query_embedding = self.model.encode(
            [query],
            convert_to_numpy=True,
            normalize_embeddings=True,
        )

        similarities = cosine_similarity(
            query_embedding,
            self.doc_embeddings,
        )[0]

        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [
            (self.documents[idx], float(similarities[idx]))
            for idx in top_indices
        ]

    # -----------------------------
    # Sparse Retrieval (BM25)
    # -----------------------------

    def sparse_search(
        self,
        query: str,
        top_k: int = 10,
    ) -> List[Tuple[Document, float]]:
        """
        BM25 keyword retrieval.
        """

        tokenized_query = query.lower().split()

        scores = self.bm25.get_scores(tokenized_query)

        top_indices = np.argsort(scores)[::-1][:top_k]

        return [
            (self.documents[idx], float(scores[idx]))
            for idx in top_indices
        ]

    # -----------------------------
    # Reciprocal Rank Fusion
    # -----------------------------

    @staticmethod
    def reciprocal_rank_fusion(
        rankings: List[List[Tuple[Document, float]]],
        k: int = 60,
    ) -> List[Tuple[Document, float]]:
        """
        Reciprocal Rank Fusion (RRF)

        RRF Score:
            score += 1 / (k + rank)
        """

        fused_scores = defaultdict(float)
        doc_lookup = {}

        for ranking in rankings:
            for rank, (doc, _) in enumerate(ranking, start=1):
                fused_scores[doc.id] += 1 / (k + rank)
                doc_lookup[doc.id] = doc

        reranked = sorted(
            fused_scores.items(),
            key=lambda x: x[1],
            reverse=True,
        )

        return [
            (doc_lookup[doc_id], score)
            for doc_id, score in reranked
        ]

    # -----------------------------
    # Hybrid Search
    # -----------------------------

    def hybrid_retrieval_logic(
        self,
        query: str,
        top_k: int = 5,
    ) -> List[Tuple[Document, float]]:
        """
        Combines:
        - Dense semantic search
        - BM25 sparse retrieval
        - RRF fusion
        """

        dense_results = self.dense_search(query, top_k=20)
        sparse_results = self.sparse_search(query, top_k=20)

        fused = self.reciprocal_rank_fusion(
            [dense_results, sparse_results]
        )

        return fused[:top_k]


# -----------------------------
# Example Usage
# -----------------------------

if __name__ == "__main__":

    docs = [
        Document(
            id="1",
            text="How to reset your password safely.",
            metadata={"category": "auth"},
        ),
        Document(
            id="2",
            text="Invoice payment and billing guide.",
            metadata={"category": "billing"},
        ),
        Document(
            id="3",
            text="SKU-9901 running shoes available now.",
            metadata={"category": "products"},
        ),
    ]

    retriever = HybridRetriever(docs)

    results = retriever.hybrid_retrieval_logic(
        "How do I settle my invoice?"
    )

    for doc, score in results:
        print(f"{score:.4f} | {doc.text}")

Why this is preferred: It is the Standard for Production RAG. It provides the best of both worlds—understanding user intent while still being able to find specific, exact-match data.

Example 3: Document Chunking with Overlap

Problem: A 50-page PDF is too big for a single vector. If you just cut it in half, you might split a sentence in the middle, losing the meaning.Solution: Use a “Recursive Character Splitter” with an Overlap to ensure that context is preserved at the boundaries of each chunk.

from typing import List

def chunk_with_overlap(text: str, size: int = 500, overlap: int = 50) -> List[str]:
    “”“
    Splits text into overlapping segments to preserve semantic continuity.
    “”“
    chunks = []
    # Simplified logic: jump by (size - overlap)
    for i in range(0, len(text), size - overlap):
        chunks.append(text[i:i + size])
    return chunks

# Execution Example
if __name__ == “__main__”:
    # segments = chunk_with_overlap(”Long document...”, size=200, overlap=50)
    pass

Why this is preferred: It ensures that every chunk has enough surrounding context to be meaningful on its own. Overlap is the “Glue” that prevents information from being “lost at the edge.”

Example 4: Metadata Filtering for Permissions

Problem: You don’t want the AI to show “Manager Salaries” to a “Junior Employee.” Solution: Store permission metadata with each vector and use a Hard Filter during retrieval.

from __future__ import annotations

from dataclasses import dataclass
from typing import List, Dict, Any

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


@dataclass
class SecureDocument:
    id: str
    text: str
    allowed_groups: List[str]
    metadata: Dict[str, Any]


class SecureVectorSearch:
    """
    Retrieval-layer security filtering.
    """

    def __init__(
        self,
        documents: List[SecureDocument],
        embedding_model: str = "all-MiniLM-L6-v2",
    ):
        self.documents = documents

        self.model = SentenceTransformer(embedding_model)

        self.embeddings = self.model.encode(
            [doc.text for doc in documents],
            convert_to_numpy=True,
            normalize_embeddings=True,
        )

    def _filter_documents(
        self,
        user_role: str,
    ) -> List[int]:
        """
        Only return indices user can access.
        """

        allowed_indices = []

        for idx, doc in enumerate(self.documents):

            if (
                user_role in doc.allowed_groups
                or "public" in doc.allowed_groups
            ):
                allowed_indices.append(idx)

        return allowed_indices

    def search(
        self,
        query: str,
        user_role: str,
        top_k: int = 5,
    ) -> List[Dict[str, Any]]:
        """
        Secure retrieval with metadata filtering.
        """

        allowed_indices = self._filter_documents(user_role)

        if not allowed_indices:
            return []

        query_embedding = self.model.encode(
            [query],
            convert_to_numpy=True,
            normalize_embeddings=True,
        )

        allowed_embeddings = self.embeddings[allowed_indices]

        similarities = cosine_similarity(
            query_embedding,
            allowed_embeddings,
        )[0]

        ranked = np.argsort(similarities)[::-1][:top_k]

        results = []

        for rank_idx in ranked:

            original_idx = allowed_indices[rank_idx]

            doc = self.documents[original_idx]

            results.append(
                {
                    "id": doc.id,
                    "text": doc.text,
                    "score": float(similarities[rank_idx]),
                    "metadata": doc.metadata,
                }
            )

        return results


# -----------------------------
# Example Usage
# -----------------------------

if __name__ == "__main__":

    documents = [
        SecureDocument(
            id="1",
            text="Public holiday schedule.",
            allowed_groups=["public"],
            metadata={"type": "general"},
        ),
        SecureDocument(
            id="2",
            text="Executive salary policy.",
            allowed_groups=["hr_manager"],
            metadata={"type": "confidential"},
        ),
    ]

    search_engine = SecureVectorSearch(documents)

    results = search_engine.search(
        query="salary policy",
        user_role="employee",
    )

    print(results)

Why this is preferred: It is the only way to build Secure AI. You should never rely on the prompt (”Only look at documents you have access to”); you must physically restrict the data at the retrieval layer.

Example 5: “Small-to-Big” Retrieval (Parent Document)

Problem: A small 200-word chunk is great for “Finding” the answer, but the model might need the “Whole Chapter” to provide a good summary. Solution: Search for the small chunk, but return the Parent Document (the whole chapter) to the LLM.

from __future__ import annotations

from dataclasses import dataclass
from typing import Dict, List, Any

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


@dataclass
class ChildChunk:
    id: str
    parent_id: str
    text: str


@dataclass
class ParentDocument:
    id: str
    full_text: str


class ParentDocumentRetriever:
    """
    Small-to-big retrieval architecture.
    """

    def __init__(
        self,
        child_chunks: List[ChildChunk],
        parent_docs: Dict[str, ParentDocument],
        embedding_model: str = "all-MiniLM-L6-v2",
    ):
        self.child_chunks = child_chunks
        self.parent_docs = parent_docs

        self.model = SentenceTransformer(embedding_model)

        self.chunk_embeddings = self.model.encode(
            [c.text for c in child_chunks],
            convert_to_numpy=True,
            normalize_embeddings=True,
        )

    def retrieve(
        self,
        query: str,
        top_k: int = 3,
    ) -> List[Dict[str, Any]]:
        """
        1. Search small chunks
        2. Return parent documents
        """

        query_embedding = self.model.encode(
            [query],
            convert_to_numpy=True,
            normalize_embeddings=True,
        )

        similarities = cosine_similarity(
            query_embedding,
            self.chunk_embeddings,
        )[0]

        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = []

        seen_parents = set()

        for idx in top_indices:

            chunk = self.child_chunks[idx]

            if chunk.parent_id in seen_parents:
                continue

            parent_doc = self.parent_docs[chunk.parent_id]

            results.append(
                {
                    "chunk_match": chunk.text,
                    "parent_document": parent_doc.full_text,
                    "score": float(similarities[idx]),
                }
            )

            seen_parents.add(chunk.parent_id)

        return results


# -----------------------------
# Example Usage
# -----------------------------

if __name__ == "__main__":

    parent_docs = {
        "chapter_1": ParentDocument(
            id="chapter_1",
            full_text="""
            Full chapter about payment systems,
            invoices, refunds, and subscriptions.
            """,
        )
    }

    chunks = [
        ChildChunk(
            id="c1",
            parent_id="chapter_1",
            text="How to pay an invoice online.",
        ),
        ChildChunk(
            id="c2",
            parent_id="chapter_1",
            text="Refund processing instructions.",
        ),
    ]

    retriever = ParentDocumentRetriever(
        child_chunks=chunks,
        parent_docs=parent_docs,
    )

    results = retriever.retrieve("invoice payment")
    print(results)

Why this is preferred: It optimizes for both Search Precision (small chunks are better vectors) and Generation Quality (big context is better for reasoning).

Example 6: Reranking with a Cross-Encoder

Problem: The vector DB’s “Top 1” result isn’t always the best one. Solution: Retrieve the top 20 “Candidate” documents, then use a more powerful “Reranker” model to pick the top 5.

from sentence_transformers import CrossEncoder

# 1. Initialize a specialized reranking model
reranker = CrossEncoder(’cross-encoder/ms-marco-MiniLM-L-6-v2’)

def rerank_results(query: str, candidates: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    “”“
    Uses a Cross-Encoder to re-score and sort retrieved documents.
    “”“
    # 2. Prepare pairs for scoring
    pairs = [[query, c[’text’]] for c in candidates]

    # 3. Get high-precision relevance scores
    scores = reranker.predict(pairs)

    # 4. Update candidates and sort
    for i, score in enumerate(scores):
        candidates[i][’rerank_score’] = score

    return sorted(candidates, key=lambda x: x[’rerank_score’], reverse=True)

# Execution Example
if __name__ == “__main__”:
    pass
    # raw_docs = [{”text”: “Apple pie...”}, {”text”: “Apple M3 Chip...”}]
    # sorted_docs = rerank_results(”How to bake?”, raw_docs)

Why this is preferred: It is the single biggest accuracy boost for RAG. Cross-encoders are much slower than vector search but significantly more accurate at finding the “Perfect Needle.”

Example 7: Self-Querying (Natural Language to Metadata)

Problem: A user asks “Show me the 2024 reports from the Marketing department.” A vector search will just look for those words. Solution:Use an LLM to “Translate” that query into a structured metadata filter.

import json
from pydantic import BaseModel

class StructuredFilter(BaseModel):
    query: str
    year: int
    dept: str

def generate_db_filter(user_input: str) -> StructuredFilter:
    “”“
    Uses an LLM to extract structured filters from natural language.
    “”“
    # Mock LLM call to extract filters
    # In production, use instructor or function calling
    return StructuredFilter(query=”reports”, year=2024, dept=”marketing”)

# Execution Example:
# filter = generate_db_filter(”2024 Marketing reports”)
# results = db.search(filter.query, filter={”year”: filter.year, “dept”: filter.dept})

Why this is preferred: It enables Structured Search via natural language. It allows users to query your database with high precision without needing to learn SQL or complex UI filters.

Example 8: Evaluation of RAG with RAGAS

Problem: How do you know if your RAG system is actually better today than it was yesterday? Solution: Use the RAGAS framework to measure “Faithfulness” (is the answer in the context?) and “Relevance.”

from datasets import Dataset
from ragas import evaluate

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)


def evaluate_rag_transaction(
    query: str,
    context: list[str],
    answer: str,
    ground_truth: str,
):
    """
    Evaluate a single RAG transaction using RAGAS.
    """

    dataset = Dataset.from_dict(
        {
            "question": [query],
            "answer": [answer],
            "contexts": [context],
            "ground_truth": [ground_truth],
        }
    )

    result = evaluate(
        dataset=dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
    )

    return result


# -----------------------------
# Example Usage
# -----------------------------

if __name__ == "__main__":

    query = "How do I reset my password?"

    context = [
        "Users can reset passwords from the settings page."
    ]

    answer = (
        "You can reset your password "
        "from the account settings page."
    )

    ground_truth = (
        "Password resets are available "
        "through account settings."
    )

    scores = evaluate_rag_transaction(
        query=query,
        context=context,
        answer=answer,
        ground_truth=ground_truth,
    )

    print(scores)

Why this is preferred: It provides Data-Driven Engineering. You can’t improve what you can’t measure. RAGAS allows you to benchmark your chunking, embedding, and reranking strategies objectively.

Conclusion: The Knowledge Layer

Vector Databases and RAG have transformed LLMs from “static calculators” into “dynamic knowledge systems.” By mastering hybrid search, chunking, and reranking, you ensure that your AI is always working with the most accurate, secure, and relevant information available.

In the next part, we will look at the Big Shift: Programmatic Prompting (DSPy), where we learn how to automate the creation of these prompts and systems entirely.

References & Further Reading

Firecrawl (2026): Best Vector Databases: A Complete Comparison Guide.
Edlitera: Vector Databases for RAG: Understanding Pinecone, Weaviate, and Qdrant.
VectorDBBench: Open Source Benchmarks for Vector Databases.
Liu et al. (2024): Lost in the Middle research on RAG context windows.

Ivan’s Substack

Discussion about this post

Ready for more?