Guides

Building AI-Powered Search with open-source tools

This guide outlines the implementation of a production-ready hybrid search pipeline. You will combine vector similarity (semantic understanding) with traditional keyword matching (BM25) and apply a cross-encoder reranking layer to ensure high-precision results for complex queries.

4-6 hours5 steps

Document Chunking and Normalization

Break large documents into smaller, semantically coherent chunks. Aim for 300-500 tokens per chunk with a 10-15% overlap. This ensures that the embedding model captures specific contexts and allows the search result to point to a specific section of a document rather than a 50-page PDF.

ingestion.py

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)

chunks = text_splitter.split_text(raw_document_content)

⚠ Common Pitfalls

•Chunking too aggressively can lose the surrounding context needed for accurate embeddings.
•Failing to normalize whitespace and remove boilerplate (e.g., navbars) leads to noisy vectors.

Dual-Indexing (Vector and Keyword)

Index the same content into two different stores: a vector database for semantic similarity and a traditional inverted index for exact keyword matching. Store the original text and unique IDs in both to facilitate merging later.

index_pipeline.js

// Indexing to Pinecone (Vector)
await pineconeIndex.upsert([
  { id: docId, values: embeddingVector, metadata: { text: chunkText } }
]);

// Indexing to Meilisearch (Keyword)
await meilisearchIndex.addDocuments([
  { id: docId, content: chunkText }
]);

⚠ Common Pitfalls

•Inconsistent IDs between the vector and keyword stores make result merging impossible.
•Ignoring metadata filters (e.g., tenant_id) in the vector store can lead to data leakage between users.

Executing Hybrid Search with Reciprocal Rank Fusion (RRF)

Run search queries against both indices simultaneously. Use Reciprocal Rank Fusion (RRF) to combine the ranked lists. RRF allows you to merge results with different scoring scales (cosine similarity vs. BM25) by focusing on their relative rank.

search_logic.py

def calculate_rrf(rank, k=60):
    return 1.0 / (k + rank)

# Example: Result A is rank 1 in Vector, rank 5 in Keyword
score_a = calculate_rrf(1) + calculate_rrf(5)

⚠ Common Pitfalls

•Using a simple weighted average of scores instead of RRF, which often fails because vector scores and BM25 scores are not on the same scale.

Implementing a Cross-Encoder Reranking Layer

Take the top 20-50 results from your hybrid search and pass them through a Cross-Encoder model (like Cohere Rerank or BGE-Reranker). Unlike embeddings, Cross-Encoders process the query and the document chunk together, providing a much more accurate relevance score.

rerank.js

const response = await cohere.rerank({
  query: userQuery,
  documents: hybridResults.map(r => r.text),
  top_n: 10,
  model: 'rerank-english-v3.0'
});

⚠ Common Pitfalls

•Reranking too many documents (e.g., >100) significantly increases latency and API costs.
•Reranking only the top 5 results provides little benefit; the goal is to find the 'needle' that was ranked 20th by the initial search.

Search UI and Confidence Thresholding

Build the UI to handle varying levels of result confidence. Use the reranker's score to decide whether to show a 'Featured Snippet' or a list of results. If the top score is below a specific threshold (e.g., 0.3), display a 'No high-confidence matches found' message.

⚠ Common Pitfalls

•Displaying low-score semantic matches as 'exact matches' confuses users when the content is irrelevant.
•Blocking the UI thread while waiting for the reranker; use skeleton loaders to maintain perceived performance.

What you built

By combining vector search for meaning, keyword search for precision, and a reranker for final validation, you create a search experience that handles both natural language questions and technical jargon. Monitor your RRF scores and reranker outputs to tune your chunking strategy over time.