Building Hybrid search (keyword + vector) with Pinecone a...
This guide provides a structured approach to implementing RAG systems, focusing on production considerations for developers working with knowledge bases, enterprise data, and AI-powered search tools. Each step includes actionable implementation details and mitigation strategies for common challenges.
Choose and configure vector database
Select a vector database that matches your scale and latency requirements. Configure indexes with appropriate dimensionality and metric type (e.g., cosine similarity).
import pinecone
pinecone.init(api_key='YOUR_API_KEY')
pinecone.create_index(name='rag-index', dimension=768, metric='cosine')⚠ Common Pitfalls
- •Using incorrect dimensionality for embedding model
- •Neglecting to set proper index metric type
Implement document preprocessing pipeline
Split documents into optimal chunks (512-1024 tokens) with 20-30% overlap. Use LangChain's RecursiveCharacterTextSplitter for consistent results.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)⚠ Common Pitfalls
- •Using fixed chunk sizes without content analysis
- •Ignoring semantic boundaries in text splitting
Generate embeddings with cost controls
Use efficient embedding models (e.g., text-embedding-ada-002) and implement batch processing. Monitor API usage limits and implement rate limiting.
import openai
embeddings = openai.Embedding.create(input=text, model='text-embedding-ada-002')['data']⚠ Common Pitfalls
- •Not using context-aware embedding models
- •Failing to implement batch processing for large datasets
Implement hybrid search strategy
Combine keyword matching (BM25) with vector similarity search. Use Weaviate's hybrid search or Pinecone's filter capabilities for multi-criteria queries.
query = 'system architecture'
results = index.query(vector=embedding, filter={'text': {'$contains': query}})⚠ Common Pitfalls
- •Over-relying on vector similarity without keyword filters
- •Ignoring case sensitivity in text filters
Integrate with LLM for answer generation
Use LangChain's VectorDBQAChain or LlamaIndex's QueryEngine to combine retrieved documents with LLM prompts. Include strict prompt engineering to reduce hallucinations.
from langchain.chains import VectorDBQAChain
chain = VectorDBQAChain.from_chain_type(llm, vectorstore=index)⚠ Common Pitfalls
- •Not including retrieved documents in LLM prompt
- •Using default LLM parameters without fine-tuning
Implement reranking for precision
Use a cross-encoder model (e.g., BERT) to re-rank retrieved documents. Prioritize relevance over raw similarity scores.
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = cross_encoder.predict([(query, doc) for doc in retrieved_docs])⚠ Common Pitfalls
- •Using dense models without proper hardware acceleration
- •Not normalizing reranking scores for comparison
What you built
By following this implementation sequence, developers can build reliable RAG systems that balance accuracy, cost, and latency. Focus on iterative testing of chunk sizes, embedding strategies, and reranking approaches to optimize for specific use cases.