Guides

Building Hybrid search (keyword + vector) with Pinecone a...

This guide provides a structured approach to implementing RAG systems, focusing on production considerations for developers working with knowledge bases, enterprise data, and AI-powered search tools. Each step includes actionable implementation details and mitigation strategies for common challenges.

2-3 hours6 steps
1

Choose and configure vector database

Select a vector database that matches your scale and latency requirements. Configure indexes with appropriate dimensionality and metric type (e.g., cosine similarity).

setup_database.py
import pinecone
pinecone.init(api_key='YOUR_API_KEY')
pinecone.create_index(name='rag-index', dimension=768, metric='cosine')

⚠ Common Pitfalls

  • Using incorrect dimensionality for embedding model
  • Neglecting to set proper index metric type
2

Implement document preprocessing pipeline

Split documents into optimal chunks (512-1024 tokens) with 20-30% overlap. Use LangChain's RecursiveCharacterTextSplitter for consistent results.

document_splitter.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

⚠ Common Pitfalls

  • Using fixed chunk sizes without content analysis
  • Ignoring semantic boundaries in text splitting
3

Generate embeddings with cost controls

Use efficient embedding models (e.g., text-embedding-ada-002) and implement batch processing. Monitor API usage limits and implement rate limiting.

generate_embeddings.py
import openai
embeddings = openai.Embedding.create(input=text, model='text-embedding-ada-002')['data']

⚠ Common Pitfalls

  • Not using context-aware embedding models
  • Failing to implement batch processing for large datasets
4

Implement hybrid search strategy

Combine keyword matching (BM25) with vector similarity search. Use Weaviate's hybrid search or Pinecone's filter capabilities for multi-criteria queries.

hybrid_search.py
query = 'system architecture'
results = index.query(vector=embedding, filter={'text': {'$contains': query}})

⚠ Common Pitfalls

  • Over-relying on vector similarity without keyword filters
  • Ignoring case sensitivity in text filters
5

Integrate with LLM for answer generation

Use LangChain's VectorDBQAChain or LlamaIndex's QueryEngine to combine retrieved documents with LLM prompts. Include strict prompt engineering to reduce hallucinations.

llm_integration.py
from langchain.chains import VectorDBQAChain
chain = VectorDBQAChain.from_chain_type(llm, vectorstore=index)

⚠ Common Pitfalls

  • Not including retrieved documents in LLM prompt
  • Using default LLM parameters without fine-tuning
6

Implement reranking for precision

Use a cross-encoder model (e.g., BERT) to re-rank retrieved documents. Prioritize relevance over raw similarity scores.

rerank_documents.py
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = cross_encoder.predict([(query, doc) for doc in retrieved_docs])

⚠ Common Pitfalls

  • Using dense models without proper hardware acceleration
  • Not normalizing reranking scores for comparison

What you built

By following this implementation sequence, developers can build reliable RAG systems that balance accuracy, cost, and latency. Focus on iterative testing of chunk sizes, embedding strategies, and reranking approaches to optimize for specific use cases.