Building Embeddings & Vector Search with open-source tools
This guide covers the implementation of a production-ready semantic search system, focusing on the transition from traditional keyword search to a hybrid vector-based approach using pgvector and OpenAI's embedding models.
Select and Benchmark Embedding Model
Choose a model based on dimensionality and cost. Higher dimensions (e.g., text-embedding-3-large at 3072) offer better semantic density but increase storage costs and latency. For most use cases, text-embedding-3-small (1536 dimensions) provides an optimal balance.
import openai
def get_embedding(text, model="text-embedding-3-small"):
text = text.replace("\n", " ")
return openai.embeddings.create(input=[text], model=model).data[0].embedding⚠ Common Pitfalls
- •Ignoring token limits (8191 tokens for OpenAI models).
- •Mixing different models in the same vector space, which yields zero-similarity results.
Configure Database Schema with Vector Indices
Define a table to store both the original content and its vector representation. Use the HNSW (Hierarchical Navigable Small World) index for faster approximate nearest neighbor (ANN) searches compared to IVFFlat, especially for datasets exceeding 100k rows.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
content text NOT NULL,
metadata jsonb,
embedding vector(1536) -- Matches text-embedding-3-small dimensions
);
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);⚠ Common Pitfalls
- •Indexing before initial data load (slower bulk inserts).
- •Choosing the wrong distance function; cosine similarity is standard for normalized text embeddings.
Implement Semantic Chunking Strategy
Large documents must be split into chunks to maintain context and stay within model limits. Use a recursive character splitter with a defined overlap (e.g., 500 characters with 50 character overlap) to ensure semantic boundaries aren't cut mid-sentence.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_text(source_document)⚠ Common Pitfalls
- •Chunk sizes that are too small (losing context) or too large (averaging out the vector's meaning).
Execute Batch Upsert Pipeline
To minimize network overhead and API latency, batch your embedding requests and database inserts. Process documents in batches of 50-100 to avoid hitting rate limits or database transaction timeouts.
def batch_upsert(texts, metadatas, batch_size=50):
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i+batch_size]
embeddings = [get_embedding(t) for t in batch_texts]
# Perform batch SQL insert: INSERT INTO documents (content, embedding) VALUES %s⚠ Common Pitfalls
- •Failing to handle 429 Rate Limit errors from embedding providers.
- •Not storing the original text alongside the vector, requiring a second lookup during retrieval.
Implement Hybrid Search with RRF
Pure vector search can fail on specific keywords or acronyms. Implement Hybrid Search by combining full-text search (BM25/TSVector) with vector similarity using Reciprocal Rank Fusion (RRF) to improve relevance.
WITH semantic_search AS (
SELECT id, rank() OVER (ORDER BY embedding <=> :query_vector) as rank
FROM documents
ORDER BY embedding <=> :query_vector
LIMIT 20
),
keyword_search AS (
SELECT id, rank() OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC) as rank
FROM documents, plainto_tsquery('english', :query_text) query
WHERE to_tsvector('english', content) @@ query
LIMIT 20
)
SELECT
COALESCE(s.id, k.id) as id,
(1.0 / (60 + COALESCE(s.rank, 100))) + (1.0 / (60 + COALESCE(k.rank, 100))) as score
FROM semantic_search s
FULL OUTER JOIN keyword_search k ON s.id = k.id
ORDER BY score DESC;⚠ Common Pitfalls
- •Over-relying on vector search for exact matches like product SKUs or serial numbers.
- •Ignoring the '60' constant in RRF, which balances the influence of top results.
Setup Data Sync and Re-indexing Triggers
Ensure the vector store stays in sync with the primary data source. Use database triggers or an application-level middleware to queue an embedding update whenever the 'content' column of your source table is modified.
⚠ Common Pitfalls
- •Updating embeddings on every small metadata change, wasting API credits.
- •Out-of-sync indices leading to 'hallucinated' search results where the vector matches but the text has changed.
What you built
Successful embedding implementation requires balancing model precision with retrieval latency. By using a hybrid search approach and HNSW indexing in pgvector, you can achieve sub-100ms response times while maintaining high relevance across both semantic and keyword-based queries.