
Building a RAG System from Scratch with Next.js: Vectors, Chunking, and Real-World Retrieval
Learn how to build a production-grade Retrieval-Augmented Generation (RAG) system using Next.js App Router and TypeScript. This deep dive covers embedding models, chunking strategies, pgvector and Pinecone setup, semantic search, hybrid retrieval, and streaming LLM responses — everything you need to ship a real-world RAG app.
What Is RAG and Why Should Frontend Engineers Care?
Retrieval-Augmented Generation (RAG) is the architecture that closes the gap between a static large language model and your actual data. Instead of fine-tuning an expensive model on your documents, RAG lets you dynamically inject relevant context at inference time — retrieving the right chunks of text, stuffing them into the prompt, and letting the LLM reason on top of fresh, private information.
As a Next.js developer, you are in a uniquely powerful position: you already control the API layer, the database, and the UI. This tutorial walks you through building a production-grade RAG pipeline entirely within the Next.js App Router ecosystem — from ingesting documents to serving semantic search results to a chat interface.
The RAG Architecture at a Glance
Before writing a single line of code, understand the two distinct phases:
- Ingestion (offline): Load documents → chunk them → embed each chunk → store vectors in a vector database.
- Retrieval + Generation (online): Embed the user query → find the nearest chunks → build a prompt → stream the LLM response.
These phases are independent. You might ingest documents once a day via a cron job and serve thousands of queries per minute in real time. Keeping them decoupled is critical for scalability.
Choosing Your Embedding Model
An embedding model converts text into a dense numerical vector that captures semantic meaning. Similar sentences end up close together in vector space; dissimilar ones are far apart. Your choice of model affects quality, cost, and latency.
- OpenAI
text-embedding-3-small— 1536 dimensions, cheap (~$0.02/1M tokens), great baseline. Use this unless you have a specific reason not to. - OpenAI
text-embedding-3-large— 3072 dimensions, higher accuracy for long-tail queries, 5× the cost. - Cohere
embed-english-v3.0— excellent multilingual support, returnsfloator compactint8vectors. - Local models via Ollama (
nomic-embed-text) — zero egress cost, runs on-prem, ideal for sensitive data.
For this tutorial we will use OpenAI's text-embedding-3-small. Install the SDK:
npm install openai @ai-sdk/openai ai
Chunking Strategies: The Overlooked Bottleneck
Chunking is where most RAG implementations quietly fail. If chunks are too large, you pay more per embedding and retrieve noisy context. If they are too small, you lose surrounding context that the LLM needs to answer accurately.
Fixed-Size Chunking with Overlap
The simplest strategy: split every N tokens, with an M-token overlap between consecutive chunks so that sentences are not cut in half.
// lib/chunker.ts
export interface Chunk {
text: string;
index: number;
metadata: Record<string, unknown>;
}
export function chunkByTokens(
text: string,
chunkSize = 512,
overlap = 64
): Chunk[] {
// Approximate: 1 token ≈ 4 characters for English text
const charChunk = chunkSize * 4;
const charOverlap = overlap * 4;
const chunks: Chunk[] = [];
let start = 0;
let index = 0;
while (start < text.length) {
const end = Math.min(start + charChunk, text.length);
chunks.push({ text: text.slice(start, end), index: index++, metadata: {} });
start += charChunk - charOverlap;
}
return chunks;
}
Recursive Character Splitting
A smarter approach: try to split on paragraph breaks first (\n\n), then sentence breaks (\n), then words. This preserves semantic units far better than raw character counts. The LangChain RecursiveCharacterTextSplitter implements exactly this logic — or you can roll your own in ~30 lines.
Document-Aware Chunking
For structured content (Markdown, HTML, code), use document-aware splitters that understand headers and code blocks. Keep an entire function body together rather than splitting it mid-loop. Libraries like llm-chunk or @langchain/textsplitters offer Markdown-aware splitters out of the box.
Rule of thumb: Start with 512-token chunks and 64-token overlap. Measure retrieval quality with a test set before optimizing further. Premature chunking optimization is a real trap.
Setting Up the Vector Database
You have two primary choices: a managed cloud service or a self-hosted Postgres extension.
Option A: pgvector (Self-Hosted / Supabase)
If you already run Postgres — or use Supabase — pgvector is the zero-friction choice. Enable it with a single migration:
-- migrations/001_enable_vector.sql
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
content text NOT NULL,
embedding vector(1536), -- matches text-embedding-3-small
metadata jsonb DEFAULT '{}'
);
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
The ivfflat index uses inverted file with flat compression. For fewer than 1M rows, hnsw (Hierarchical Navigable Small World) is faster at query time at the cost of a longer build:
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Option B: Pinecone (Managed)
Pinecone requires zero database management. Create an index in the dashboard (or via API), choose your dimension count, and start upserting vectors immediately. It scales to billions of vectors without you touching a single server.
npm install @pinecone-database/pinecone
// lib/pinecone.ts
import { Pinecone } from '@pinecone-database/pinecone';
export const pinecone = new Pinecone({
apiKey: process.env.PINECONE_API_KEY!,
});
export const index = pinecone.index(process.env.PINECONE_INDEX_NAME!);
The Ingestion Pipeline: A Next.js Route Handler
Wire everything together in a single API route that accepts a document, chunks it, embeds it, and stores the vectors.
// app/api/ingest/route.ts
import { NextRequest, NextResponse } from 'next/server';
import OpenAI from 'openai';
import { chunkByTokens } from '@/lib/chunker';
import { db } from '@/lib/db'; // your Postgres client (e.g. drizzle/kysely)
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function POST(req: NextRequest) {
const { content, metadata } = await req.json();
// 1. Chunk the document
const chunks = chunkByTokens(content, 512, 64);
// 2. Embed all chunks in a single batched request
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: chunks.map((c) => c.text),
});
// 3. Persist to pgvector
const rows = chunks.map((chunk, i) => ({
content: chunk.text,
embedding: embeddingResponse.data[i].embedding,
metadata: { ...metadata, chunkIndex: chunk.index },
}));
await db.transaction(async (tx) => {
for (const row of rows) {
await tx.execute(
`INSERT INTO documents (content, embedding, metadata)
VALUES ($1, $2::vector, $3)`,
[row.content, JSON.stringify(row.embedding), JSON.stringify(row.metadata)]
);
}
});
return NextResponse.json({ chunksIngested: rows.length });
}
Batching is critical. The OpenAI embeddings API accepts up to 2048 inputs per request. Sending 100 chunks in one request is ~100× faster than 100 individual requests and avoids rate limit errors.
The Retrieval Pipeline: Semantic Search
At query time, embed the user's question and find the nearest document chunks by cosine similarity.
// lib/retrieve.ts
import OpenAI from 'openai';
import { db } from '@/lib/db';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function retrieveRelevantChunks(
query: string,
topK = 5
): Promise<{ content: string; metadata: Record<string, unknown> }[]> {
// 1. Embed the query
const { data } = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query,
});
const queryVector = data[0].embedding;
// 2. pgvector cosine similarity search
const result = await db.execute<{ content: string; metadata: unknown }>(
`SELECT content, metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[JSON.stringify(queryVector), topK]
);
return result.rows.map((r) => ({
content: r.content,
metadata: r.metadata as Record<string, unknown>,
}));
}
The <=> operator is pgvector's cosine distance operator. It returns a value between 0 (identical) and 2 (opposite). Subtracting from 1 gives you similarity — but for the ORDER BY clause you just need ascending distance, so the subtraction is optional.
Streaming the RAG Response
Now connect retrieval to generation. Use the Vercel AI SDK's streamText for a first-class streaming experience in Next.js.
// app/api/chat/route.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { NextRequest } from 'next/server';
import { retrieveRelevantChunks } from '@/lib/retrieve';
export const runtime = 'nodejs';
export async function POST(req: NextRequest) {
const { messages } = await req.json();
const lastMessage = messages[messages.length - 1].content as string;
// 1. Retrieve relevant context
const chunks = await retrieveRelevantChunks(lastMessage, 5);
const context = chunks.map((c) => c.content).join('\n\n---\n\n');
// 2. Inject context into the system prompt
const systemPrompt = `You are a helpful assistant. Answer the user's question
based ONLY on the following context. If the context does not contain enough
information to answer, say so honestly.
Context:
${context}`;
// 3. Stream the response
const result = streamText({
model: openai('gpt-4o-mini'),
system: systemPrompt,
messages,
});
return result.toDataStreamResponse();
}
On the client side, the Vercel AI SDK's useChat hook handles streaming automatically:
// app/chat/page.tsx
'use client';
import { useChat } from 'ai/react';
export default function ChatPage() {
const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({
api: '/api/chat',
});
return (
<div className="flex flex-col h-screen max-w-2xl mx-auto p-4">
<div className="flex-1 overflow-y-auto space-y-4">
{messages.map((m) => (
<div key={m.id} className={m.role === 'user' ? 'text-right' : 'text-left'}>
<span className="inline-block bg-muted rounded-lg px-4 py-2">
{m.content}
</span>
</div>
))}
</div>
<form onSubmit={handleSubmit} className="flex gap-2 mt-4">
<input
value={input}
onChange={handleInputChange}
placeholder="Ask anything..."
className="flex-1 border rounded-lg px-4 py-2"
/>
<button type="submit" disabled={isLoading}>Send</button>
</form>
</div>
);
}
Advanced Retrieval Techniques
Hybrid Search (Keyword + Semantic)
Pure vector search struggles with exact matches — product codes, names, IDs. Combine it with full-text search using a Reciprocal Rank Fusion (RRF) merge:
-- Hybrid search with RRF
SELECT id, content,
(1.0 / (60 + fts_rank)) + (1.0 / (60 + vec_rank)) AS rrf_score
FROM (
SELECT id, content,
ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', content), query) DESC) AS fts_rank,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS vec_rank
FROM documents,
plainto_tsquery('english', $2) query
) ranked
ORDER BY rrf_score DESC
LIMIT 5;
Re-ranking
Retrieve a larger candidate set (top-20) with fast vector search, then re-rank using a more accurate cross-encoder model (e.g., Cohere Rerank or a local BAAI/bge-reranker-base). This two-stage approach gives you the speed of ANN search with near-exact precision.
Metadata Filtering
Always store metadata (source URL, document type, date, author) alongside your vectors. Pre-filter by metadata before the ANN search to dramatically reduce the search space and avoid results from irrelevant sources:
SELECT content FROM documents
WHERE metadata->>'source' = 'docs.myapp.com'
AND (metadata->>'updatedAt')::date > NOW() - INTERVAL '30 days'
ORDER BY embedding <=> $1::vector
LIMIT 5;
Production Checklist
- Rate limiting: Wrap your ingestion endpoint with a job queue (BullMQ, Inngest) to avoid hammering the embeddings API.
- Embedding cache: Cache embeddings for identical strings in Redis. Documents rarely change; re-embedding on every request wastes money.
- Index maintenance: Run
VACUUM ANALYZE documentsperiodically and rebuild the ivfflat/hnsw index after bulk ingestion to keep query performance optimal. - Observability: Log query, retrieved chunks, and final LLM answer to a traces table. Use LangSmith or Langfuse to spot retrieval failures.
- Chunking evaluation: Build a small golden dataset of query→expected-chunk pairs and measure recall@5 before shipping to production.
Conclusion
RAG is not magic — it is disciplined engineering. The quality of your system depends far more on how you chunk and retrieve than on which LLM you call at the end. Start simple: fixed-size chunks, text-embedding-3-small, pgvector on Supabase. Ship it. Measure retrieval quality against real queries. Then layer in hybrid search, re-ranking, and metadata filtering as your use case demands.
Next.js App Router gives you the perfect foundation: server components for data-heavy ingestion UIs, Route Handlers for streaming API endpoints, and first-class TypeScript throughout. The stack is approachable, the pieces are composable, and — with the patterns above — you can go from zero to a production RAG system in a weekend.
Related Articles
Get weekly highlights
No spam, unsubscribe anytime.
Ranked.ai
AI-powered SEO & PPC service — fully managed, white hat, and built for modern search engines. Starting at $99/month.



Comments (0)
Sign in to comment
No comments yet. Be the first to comment!