Guides

Building Databases for AI Apps with open-source tools

This guide details the implementation of a production-grade database architecture for AI applications using PostgreSQL with the pgvector extension and Drizzle ORM. It focuses on the transition from prototype-level vector storage to a scalable, type-safe infrastructure capable of handling hybrid relational and semantic queries in serverless environments.

45 minutes5 steps

Enable Vector Extensions and Schema Initialization

Before storing embeddings, the pgvector extension must be explicitly enabled in your database. This step also involves setting up the initial database schema to house both structured metadata and high-dimensional vectors.

0000_init.sql

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  metadata JSONB,
  embedding vector(1536), -- 1536 for OpenAI text-embedding-3-small
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

⚠ Common Pitfalls

•Specifying the wrong vector dimension (e.g., using 1536 for a model that outputs 768) will cause insertion failures.
•Forgetting to enable the extension in the 'public' schema or the specific schema used by your application.

Define Type-Safe Schema with Drizzle ORM

Map the database structure to your application code using Drizzle ORM. This ensures that vector operations are type-checked and that your metadata fields are correctly handled as JSONB.

schema.ts

import { pgTable, uuid, text, jsonb, customType, timestamp } from 'drizzle-orm/pg-core';

// Custom type for pgvector
const vector = customType<{ data: number[] }>({
  dataType() { return 'vector(1536)'; },
  toDriver(value: number[]) { return `[${value.join(',')}]`; },
  fromDriver(value: string) { return value.replace(/[\[\]]/g, '').split(',').map(Number); }
});

export const documents = pgTable('documents', {
  id: uuid('id').primaryKey().defaultRandom(),
  content: text('content').notNull(),
  metadata: jsonb('metadata'),
  embedding: vector('embedding'),
  createdAt: timestamp('created_at').defaultNow(),
});

⚠ Common Pitfalls

•Drizzle does not have native vector type support in all versions; custom types are required to handle the cast between JS arrays and Postgres vector strings.

Configure Connection Pooling for Serverless

AI applications often run on serverless functions (Vercel, AWS Lambda) which can quickly exhaust Postgres connection limits. Use a connection pooler like PgBouncer or a driver-level pooler like @neondatabase/serverless.

db.ts

import { drizzle } from 'drizzle-orm/postgres-js';
import postgres from 'postgres';

const connectionString = process.env.DATABASE_URL!;

// Disable prefetch as it is often incompatible with serverless poolers
const client = postgres(connectionString, { prepare: false });
export const db = drizzle(client);

⚠ Common Pitfalls

•Using default 'prepare: true' settings with external poolers like PgBouncer (transaction mode) causes 'prepared statement already exists' errors.

Implement HNSW Indexing for Performance

Standard IVFFlat indexes require a minimum amount of data before they become effective. HNSW (Hierarchical Navigable Small World) indexes are generally preferred for AI apps due to higher recall and better performance without needing a training step.

0001_hnsw_index.sql

CREATE INDEX ON documents 
USING hnsw (embedding vector_cosine_ops) 
WITH (m = 16, ef_construction = 64);

⚠ Common Pitfalls

•Creating HNSW indexes on large datasets is memory-intensive; ensure your database instance has sufficient RAM relative to the index size.
•Choosing 'vector_l2_ops' instead of 'vector_cosine_ops' if your embedding model is optimized for cosine similarity.

Execute Hybrid Search Queries

Combine semantic search with relational filters. This pattern allows you to find relevant content while restricting results by user ID, date ranges, or categories stored in the metadata.

search.ts

import { db } from './db';
import { documents } from './schema';
import { sql, and, eq } from 'drizzle-orm';

async function searchDocuments(queryEmbedding: number[], userId: string) {
  return await db.select({
    id: documents.id,
    content: documents.content,
    similarity: sql<number>`1 - (${documents.embedding} <=> ${JSON.stringify(queryEmbedding)})`
  })
  .from(documents)
  .where(and(
    sql`${documents.metadata}->>'user_id' = ${userId}`,
    sql`${documents.embedding} <=> ${JSON.stringify(queryEmbedding)} < 0.5` // Threshold
  ))
  .orderBy(sql`${documents.embedding} <=> ${JSON.stringify(queryEmbedding)}`)
  .limit(5); 
}

⚠ Common Pitfalls

•Stringifying large embedding arrays inside SQL templates can lead to query size limits; use parameterized inputs or binary formats where supported.
•Distance operators (<=> for cosine, <-> for L2) return distance, not similarity; subtract from 1 for cosine similarity.

What you built

You have implemented a robust database foundation for an AI application. By using PostgreSQL with pgvector, you avoid the complexity of managing a separate vector database while maintaining high-performance semantic search. For next steps, monitor index bloat and consider partitioning your tables if your document count exceeds 10 million records.