Building RAG (Retrieval-Augmented Generation) with open-s...
This guide outlines the implementation of a production-grade RAG pipeline, focusing on the transition from basic vector search to a high-precision retrieval system. We address the core challenges of chunking strategy, hybrid search, and reranking to ensure responses are grounded in provided data while maintaining low latency.
Define Chunking and Overlap Strategy
Select a chunking strategy based on your document structure. For standard technical documentation, a recursive character splitter with a chunk size of 512-1024 tokens and a 10-15% overlap is recommended to maintain semantic context across boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
length_function=len,
is_separator_regex=False,
)
chunks = text_splitter.split_text(raw_document_content)⚠ Common Pitfalls
- •Using fixed-length splitting that cuts through sentences or code blocks.
- •Zero overlap leading to loss of context at the edges of chunks.
Generate and Store Embeddings with Metadata
Convert chunks into vectors using an embedding model (e.g., text-embedding-3-small). Store these in your vector database alongside metadata like 'source_url', 'page_number', and 'last_updated'. Metadata is critical for post-retrieval filtering and source attribution.
import openai
def get_embedding(text, model="text-embedding-3-small"):
return openai.embeddings.create(input=[text], model=model).data[0].embedding
# Batch upsert to vector DB
vectors = [
{"id": f"vec_{i}", "values": get_embedding(chunk), "metadata": {"text": chunk, "source": "doc_1"}}
for i, chunk in enumerate(chunks)
]
index.upsert(vectors)⚠ Common Pitfalls
- •Ignoring rate limits during bulk embedding generation.
- •Storing large text chunks in the vector DB without an external storage reference, increasing DB costs.
Implement Hybrid Search (Vector + Keyword)
Pure vector search often fails on specific acronyms or product IDs. Implement hybrid search by combining Cosine Similarity (vector) with BM25 (keyword) using reciprocal rank fusion (RRF) to improve retrieval recall.
# Example logic for a hybrid query using Pinecone/Weaviate style
results = index.query(
vector=query_embedding,
top_k=20,
include_metadata=True,
filter={"category": {"$eq": "documentation"}},
hybrid_search={"alpha": 0.5} # Balance between keyword and vector
)⚠ Common Pitfalls
- •Setting alpha to 1.0 (pure vector) or 0.0 (pure keyword) and missing relevant context.
- •Failing to normalize scores between different search algorithms.
Integrate a Reranking Step
Retrieving the top 20 results via vector search provides high recall but low precision. Use a Cross-Encoder reranker (like Cohere Rerank) to score the relevance of the retrieved chunks against the query, passing only the top 5 most relevant chunks to the LLM.
import cohere
co = cohere.Client('YOUR_API_KEY')
results = co.rerank(
query=user_query,
documents=retrieved_chunks,
top_n=5,
model='rerank-english-v3.0'
)⚠ Common Pitfalls
- •Reranking too many documents (e.g., >100), which significantly increases total request latency.
- •Using a reranker that was trained on a vastly different domain than your data.
Construct the Context-Injected Prompt
Feed the reranked chunks into the LLM prompt. Use a system message that explicitly instructs the model to use ONLY the provided context and to state 'I don't know' if the answer isn't present to prevent hallucinations.
System: You are a technical assistant. Use the following context to answer the question.
Context: {context_chunks}
Question: {user_query}
Rules:
1. Only use the provided context.
2. If the answer is not in the context, say 'I do not have enough information'.
3. Include source citations in [1], [2] format.⚠ Common Pitfalls
- •Exceeding the LLM's context window by including too many or too large chunks.
- •Vague system prompts that allow the LLM to use its general training data instead of the retrieved context.
What you built
A successful RAG implementation moves beyond simple semantic search by incorporating robust chunking, hybrid retrieval, and reranking. To maintain performance, monitor your 'hit rate' (how often the correct answer is in the retrieved chunks) and 'faithfulness' (how often the LLM answer is supported by the context).