RAG Patterns & Best Practices
Advanced patterns, techniques, and best practices for building production-ready RAG applications with HazelJS.
RAG Architecture Patterns
Basic RAG Pattern
The simplest RAG implementation: retrieve relevant documents and use them as context.
import { RAGPipeline, OpenAIEmbeddings, MemoryVectorStore } from '@hazeljs/rag';
import { AIService } from '@hazeljs/ai';
async function basicRAG(query: string) {
// 1. Setup
const embeddings = new OpenAIEmbeddings({
apiKey: process.env.OPENAI_API_KEY,
});
const vectorStore = new MemoryVectorStore(embeddings);
await vectorStore.initialize();
const rag = new RAGPipeline({
vectorStore,
embeddingProvider: embeddings,
topK: 3,
});
// 2. Retrieve
const results = await rag.query(query);
// 3. Format context
const context = results.sources
.map(s => s.content)
.join('\n\n');
// 4. Generate with LLM
const aiService = new AIService();
const response = await aiService.executeTask({
name: 'rag-answer',
provider: 'openai',
model: 'gpt-4-turbo-preview',
prompt: `Context:\n${context}\n\nQuestion: ${query}\n\nAnswer:`,
outputType: 'string',
}, {});
return {
answer: response.data,
sources: results.sources,
};
}
Hybrid RAG Pattern
Combines vector search with keyword search for better retrieval.
import { HybridSearchRetrieval } from '@hazeljs/rag';
async function hybridRAG(query: string) {
const hybridSearch = new HybridSearchRetrieval(vectorStore, {
vectorWeight: 0.7,
keywordWeight: 0.3,
});
await hybridSearch.indexDocuments(documents);
const results = await hybridSearch.search(query, { topK: 5 });
// Continue with LLM generation...
}
Multi-Query RAG Pattern
Generates multiple query variations for comprehensive retrieval.
import { MultiQueryRetrieval } from '@hazeljs/rag';
async function multiQueryRAG(query: string) {
const multiQuery = new MultiQueryRetrieval(vectorStore, {
numQueries: 3,
llmProvider: 'openai',
apiKey: process.env.OPENAI_API_KEY,
});
const results = await multiQuery.retrieve(query, { topK: 10 });
// Results are already deduped and ranked
// Continue with LLM generation...
}
Conversational RAG Pattern
Maintains conversation history for context-aware responses.
interface ConversationMessage {
role: 'user' | 'assistant';
content: string;
}
class ConversationalRAG {
private conversations = new Map<string, ConversationMessage[]>();
async chat(sessionId: string, message: string) {
// 1. Get conversation history
const history = this.conversations.get(sessionId) || [];
// 2. Retrieve relevant documents
const results = await rag.query(message);
const context = results.sources.map(s => s.content).join('\n\n');
// 3. Format with history
const historyText = history
.map(m => `${m.role}: ${m.content}`)
.join('\n');
// 4. Generate response
const response = await aiService.executeTask({
name: 'conversational-rag',
provider: 'openai',
model: 'gpt-4-turbo-preview',
prompt: `
Conversation History:
${historyText}
Context:
${context}
User: ${message}
Assistant:
`,
outputType: 'string',
}, {});
// 5. Update history
history.push(
{ role: 'user', content: message },
{ role: 'assistant', content: response.data }
);
this.conversations.set(sessionId, history);
return {
answer: response.data,
sources: results.sources,
};
}
}
Stateful RAG with Database Persistence
Use the Agent Runtime with database state management for production-grade conversational RAG:
import { Agent, Tool, AgentRuntime } from '@hazeljs/agent';
import { DatabaseStateManager } from '@hazeljs/agent/state';
import { RAGPipeline } from '@hazeljs/rag';
import { PrismaClient } from '@prisma/client';
@Agent({ name: 'conversational-rag-agent' })
class ConversationalRAGAgent {
constructor(private rag: RAGPipeline) {}
@Tool({ description: 'Answer questions using RAG with conversation context' })
async answer(query: string): Promise<any> {
// Agent runtime automatically provides conversation history
// from database state manager
const results = await this.rag.query(query);
return {
answer: results.answer,
sources: results.sources,
confidence: results.confidence,
};
}
}
// Setup with database persistence
const prisma = new PrismaClient();
const stateManager = new DatabaseStateManager({
client: prisma,
softDelete: true, // Keep deleted contexts for audit
autoArchive: true, // Archive old conversations
archiveThresholdDays: 30, // Archive after 30 days
});
const runtime = new AgentRuntime({
stateManager,
// ... other config
});
// Execute with persistent conversation state
const result = await runtime.execute('conversational-rag-agent', 'What is HazelJS?', {
sessionId: 'user-123',
userId: 'user-abc',
enableMemory: true, // Automatic conversation history
enableRAG: true, // Enable RAG integration
});
// Continue conversation - history is automatically loaded from database
const followUp = await runtime.resume(result.executionId, 'Tell me more about its features');
// Query all conversations for a session
const sessionContexts = await stateManager.getSessionContexts('user-123');
console.log(`Found ${sessionContexts.length} conversations in this session`);
Database State Manager Features:
- Automatic State Persistence: All conversation history, steps, and context saved to database
- Session Management: Track multiple conversations per user/session
- Pause/Resume: Long-running RAG queries can be paused and resumed
- Soft Deletes: Keep audit trail of deleted conversations
- Auto-Archiving: Automatically archive old conversations
- Working Memory: Store temporary context and variables
- Entity Tracking: Track entities mentioned in conversations
- Full Audit Trail: Complete history of all agent steps and decisions
Benefits:
- Production-Ready: Durable state management for production applications
- Scalable: Database-backed state works across multiple instances
- Queryable: SQL queries for analytics and monitoring
- Recoverable: Resume conversations after crashes or restarts
- Compliant: Full audit trail for compliance requirements
Document Chunking Strategies
Fixed-Size Chunking
Simple but effective for uniform content.
import { RecursiveTextSplitter } from '@hazeljs/rag';
const splitter = new RecursiveTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const chunks = splitter.splitText(longDocument);
Best For:
- Uniform content (articles, documentation)
- Simple implementation
- Predictable chunk sizes
Semantic Chunking
Split by meaning rather than size (future enhancement).
// Future implementation
const semanticSplitter = new SemanticTextSplitter({
embeddingProvider: embeddings,
similarityThreshold: 0.8,
});
Best For:
- Preserving context
- Complex documents
- Better retrieval quality
Document Structure-Aware Chunking
Split by document structure (headings, paragraphs).
function splitByStructure(markdown: string) {
const sections = markdown.split(/^##\s+/gm);
return sections.map(section => ({
content: section,
metadata: {
heading: section.split('\n')[0],
type: 'section',
},
}));
}
Best For:
- Structured documents (Markdown, HTML)
- Maintaining hierarchy
- Better context preservation
Metadata Strategies
Rich Metadata
Add comprehensive metadata for better filtering and ranking.
await vectorStore.addDocuments([
{
content: 'Document content',
metadata: {
// Source information
source: 'documentation',
url: 'https://example.com/doc',
// Temporal information
createdAt: '2024-01-01',
updatedAt: '2024-01-15',
// Classification
category: 'technical',
tags: ['typescript', 'framework', 'rag'],
// Quality signals
author: 'John Doe',
reviewStatus: 'approved',
// Custom fields
importance: 'high',
audience: 'developers',
},
},
]);
Hierarchical Metadata
Organize documents in hierarchies.
await vectorStore.addDocuments([
{
content: 'Chapter content',
metadata: {
book: 'HazelJS Guide',
chapter: 'RAG Package',
section: 'Vector Stores',
subsection: 'Pinecone',
hierarchy: ['book', 'chapter', 'section', 'subsection'],
},
},
]);
// Search within hierarchy
const results = await vectorStore.search('pinecone setup', {
filter: {
book: 'HazelJS Guide',
chapter: 'RAG Package',
},
});
Temporal Metadata
Track document freshness and relevance.
function addTemporalMetadata(doc: Document) {
return {
...doc,
metadata: {
...doc.metadata,
indexedAt: new Date().toISOString(),
expiresAt: new Date(Date.now() + 30 * 24 * 60 * 60 * 1000).toISOString(),
version: '1.0',
},
};
}
// Filter by freshness
const recentResults = await vectorStore.search('query', {
filter: {
indexedAt: { $gte: '2024-01-01' },
},
});
Query Optimization
Query Expansion
Expand user queries with synonyms and related terms.
async function expandQuery(query: string): Promise<string[]> {
const aiService = new AIService();
const expansion = await aiService.executeTask({
name: 'query-expansion',
provider: 'openai',
model: 'gpt-3.5-turbo',
prompt: `
Generate 3 alternative phrasings for this query: "${query}"
Return as JSON array of strings.
`,
outputType: 'json',
}, {});
return [query, ...expansion.data];
}
// Use expanded queries
const queries = await expandQuery('vector database setup');
const allResults = await Promise.all(
queries.map(q => vectorStore.search(q))
);
Query Rewriting
Rewrite queries for better retrieval.
async function rewriteQuery(query: string): Promise<string> {
const aiService = new AIService();
const rewritten = await aiService.executeTask({
name: 'query-rewrite',
provider: 'openai',
model: 'gpt-3.5-turbo',
prompt: `
Rewrite this query to be more specific and searchable: "${query}"
Focus on key technical terms and concepts.
`,
outputType: 'string',
}, {});
return rewritten.data;
}
Query Classification
Route queries to appropriate retrieval strategies.
async function classifyQuery(query: string) {
// Simple classification
const hasQuestionWords = /^(what|how|why|when|where|who)/i.test(query);
const hasTechnicalTerms = /\b(api|code|function|class|error)\b/i.test(query);
if (hasQuestionWords) {
return 'semantic'; // Use vector search
} else if (hasTechnicalTerms) {
return 'hybrid'; // Use hybrid search
} else {
return 'keyword'; // Use BM25
}
}
async function smartSearch(query: string) {
const strategy = await classifyQuery(query);
switch (strategy) {
case 'semantic':
return vectorStore.search(query);
case 'hybrid':
return hybridSearch.search(query);
case 'keyword':
return bm25.search(query);
}
}
Response Generation
Citation-Aware Generation
Include source citations in responses.
async function generateWithCitations(query: string) {
const results = await rag.query(query);
const contextWithCitations = results.sources
.map((source, idx) => `[${idx + 1}] ${source.content}`)
.join('\n\n');
const response = await aiService.executeTask({
name: 'cited-answer',
provider: 'openai',
model: 'gpt-4-turbo-preview',
prompt: `
Context (with citations):
${contextWithCitations}
Question: ${query}
Provide an answer and cite sources using [1], [2], etc.
`,
outputType: 'string',
}, {});
return {
answer: response.data,
sources: results.sources.map((s, idx) => ({
citation: `[${idx + 1}]`,
content: s.content,
metadata: s.metadata,
})),
};
}
Confidence Scoring
Score answer confidence based on retrieval quality.
function calculateConfidence(results: SearchResult[]): number {
if (results.length === 0) return 0;
// Average similarity score
const avgScore = results.reduce((sum, r) => sum + r.score, 0) / results.length;
// Score distribution (lower variance = higher confidence)
const variance = results.reduce(
(sum, r) => sum + Math.pow(r.score - avgScore, 2),
0
) / results.length;
// Combine metrics
const confidence = avgScore * (1 - Math.min(variance, 0.5));
return Math.round(confidence * 100);
}
async function answerWithConfidence(query: string) {
const results = await rag.query(query);
const confidence = calculateConfidence(results.sources);
if (confidence < 50) {
return {
answer: "I don't have enough information to answer confidently.",
confidence,
sources: results.sources,
};
}
// Generate answer...
}
Performance Optimization
Caching Strategy
Cache embeddings and search results for better performance.
In-Memory Caching (Development)
Simple in-memory caching for development:
class CachedRAG {
private embeddingCache = new Map<string, number[]>();
private searchCache = new Map<string, SearchResult[]>();
async getEmbedding(text: string): Promise<number[]> {
if (this.embeddingCache.has(text)) {
return this.embeddingCache.get(text)!;
}
const embedding = await embeddings.embed(text);
this.embeddingCache.set(text, embedding);
return embedding;
}
async search(query: string): Promise<SearchResult[]> {
const cacheKey = `${query}:${Date.now() / 60000 | 0}`; // 1-minute cache
if (this.searchCache.has(cacheKey)) {
return this.searchCache.get(cacheKey)!;
}
const results = await vectorStore.search(query);
this.searchCache.set(cacheKey, results);
return results;
}
}
Redis Caching (Production)
Use Redis for distributed caching across multiple instances:
import { CacheService } from '@hazeljs/cache';
import { RAGPipeline } from '@hazeljs/rag';
class ProductionRAGService {
private cache: CacheService;
private rag: RAGPipeline;
constructor() {
// Initialize Redis cache
this.cache = new CacheService({
strategy: 'redis',
redis: {
host: process.env.REDIS_HOST || 'localhost',
port: parseInt(process.env.REDIS_PORT || '6379'),
password: process.env.REDIS_PASSWORD,
},
ttl: 3600, // 1 hour default
});
this.rag = new RAGPipeline({
vectorStore,
embeddingProvider: embeddings,
});
}
async search(query: string, options?: { topK?: number }): Promise<SearchResult[]> {
const cacheKey = `rag:search:${query}:${options?.topK || 5}`;
// Try cache first
const cached = await this.cache.get<SearchResult[]>(cacheKey);
if (cached) {
console.log('Cache hit for query:', query);
return cached;
}
// Perform search
console.log('Cache miss, performing search:', query);
const results = await this.rag.search(query, options);
// Cache results with tags for invalidation
await this.cache.setWithTags(
cacheKey,
results,
3600, // 1 hour TTL
['rag-searches', `topK:${options?.topK || 5}`]
);
return results;
}
async addDocuments(documents: Document[]): Promise<void> {
// Add documents to vector store
await this.rag.addDocuments(documents);
// Invalidate all search caches since index changed
await this.cache.invalidateTags(['rag-searches']);
console.log('Invalidated all RAG search caches');
}
async getCachedStats(): Promise<any> {
return await this.cache.getStats();
}
}
// Usage
const ragService = new ProductionRAGService();
// First search - cache miss
const results1 = await ragService.search('What is HazelJS?');
// Second search - cache hit (fast!)
const results2 = await ragService.search('What is HazelJS?');
// Add new documents - invalidates cache
await ragService.addDocuments([
{ content: 'New documentation', metadata: {} }
]);
// Next search - cache miss (cache was invalidated)
const results3 = await ragService.search('What is HazelJS?');
// Check cache performance
const stats = await ragService.getCachedStats();
console.log('Cache hit rate:', stats.hitRate);
Multi-Tier Caching (Optimal Performance)
Combine memory and Redis for best performance:
import { CacheService } from '@hazeljs/cache';
class OptimizedRAGService {
private cache: CacheService;
constructor() {
// Multi-tier: L1 (memory) + L2 (Redis)
this.cache = new CacheService({
strategy: 'multi-tier',
redis: {
host: process.env.REDIS_HOST || 'localhost',
port: parseInt(process.env.REDIS_PORT || '6379'),
},
ttl: 3600,
});
}
async search(query: string): Promise<SearchResult[]> {
return await this.cache.getOrSet(
`rag:search:${query}`,
async () => {
// Only called on cache miss
return await this.rag.search(query);
},
3600,
['rag-searches']
);
}
}
Benefits of Redis Caching:
- Distributed: Shared cache across multiple server instances
- Persistent: Cache survives server restarts
- Scalable: Handle high traffic with Redis cluster
- Fast: Sub-millisecond response times for cached queries
- Tag-based invalidation: Invalidate related caches together
Batch Processing
Process documents in batches for efficiency.
async function indexLargeDataset(documents: Document[]) {
const batchSize = 100;
for (let i = 0; i < documents.length; i += batchSize) {
const batch = documents.slice(i, i + batchSize);
console.log(`Processing batch ${i / batchSize + 1}...`);
await vectorStore.addDocuments(batch);
// Rate limiting
await new Promise(resolve => setTimeout(resolve, 1000));
}
}
Parallel Retrieval
Retrieve from multiple sources in parallel.
async function parallelRetrieval(query: string) {
const [vectorResults, keywordResults, cachedResults] = await Promise.all([
vectorStore.search(query),
bm25.search(query, 10),
getCachedResults(query),
]);
// Merge and deduplicate
const allResults = [...vectorResults, ...keywordResults, ...cachedResults];
const uniqueResults = deduplicateById(allResults);
return uniqueResults.sort((a, b) => b.score - a.score).slice(0, 10);
}
Error Handling
Graceful Degradation
Fallback to simpler strategies on failure.
async function robustSearch(query: string) {
try {
// Try hybrid search first
return await hybridSearch.search(query);
} catch (error) {
console.warn('Hybrid search failed, falling back to vector search');
try {
return await vectorStore.search(query);
} catch (error) {
console.error('Vector search failed, using cached results');
return getCachedResults(query);
}
}
}
Retry Logic
Implement exponential backoff for transient failures.
async function searchWithRetry(
query: string,
maxRetries = 3
): Promise<SearchResult[]> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await vectorStore.search(query);
} catch (error) {
if (attempt === maxRetries - 1) throw error;
const delay = Math.pow(2, attempt) * 1000;
console.log(`Retry ${attempt + 1} after ${delay}ms`);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Max retries exceeded');
}
Monitoring and Observability
Metrics Collection
Track key RAG metrics.
class RAGMetrics {
private metrics = {
queries: 0,
avgLatency: 0,
avgRelevance: 0,
cacheHits: 0,
errors: 0,
};
async trackQuery(fn: () => Promise<any>) {
const start = Date.now();
this.metrics.queries++;
try {
const result = await fn();
const latency = Date.now() - start;
this.metrics.avgLatency =
(this.metrics.avgLatency * (this.metrics.queries - 1) + latency) /
this.metrics.queries;
return result;
} catch (error) {
this.metrics.errors++;
throw error;
}
}
getMetrics() {
return { ...this.metrics };
}
}
Logging
Structured logging for debugging.
function logRAGOperation(operation: string, data: any) {
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
operation,
...data,
}));
}
async function searchWithLogging(query: string) {
logRAGOperation('search_start', { query });
const start = Date.now();
const results = await vectorStore.search(query);
const duration = Date.now() - start;
logRAGOperation('search_complete', {
query,
resultCount: results.length,
duration,
topScore: results[0]?.score,
});
return results;
}
Testing Strategies
Unit Testing
Test individual components.
import { describe, it, expect } from '@jest/globals';
describe('RAG Pipeline', () => {
it('should retrieve relevant documents', async () => {
const rag = new RAGPipeline({
vectorStore: new MemoryVectorStore(embeddings),
embeddingProvider: embeddings,
});
await rag.addDocuments([
{ content: 'TypeScript is a typed superset of JavaScript' },
]);
const results = await rag.query('What is TypeScript?');
expect(results.sources).toHaveLength(1);
expect(results.sources[0].score).toBeGreaterThan(0.7);
});
});
Integration Testing
Test end-to-end RAG flow.
describe('RAG Integration', () => {
it('should answer questions correctly', async () => {
const answer = await basicRAG('What is HazelJS?');
expect(answer.answer).toContain('framework');
expect(answer.sources.length).toBeGreaterThan(0);
});
});
Evaluation Metrics
Measure RAG quality.
interface EvaluationResult {
precision: number;
recall: number;
f1Score: number;
}
function evaluateRetrieval(
retrieved: SearchResult[],
relevant: string[]
): EvaluationResult {
const retrievedIds = new Set(retrieved.map(r => r.id));
const relevantIds = new Set(relevant);
const truePositives = [...retrievedIds].filter(id => relevantIds.has(id)).length;
const precision = truePositives / retrieved.length;
const recall = truePositives / relevant.length;
const f1Score = 2 * (precision * recall) / (precision + recall);
return { precision, recall, f1Score };
}
Production Checklist
Before Deployment
- Choose appropriate vector store for scale
- Implement error handling and retries
- Add monitoring and logging
- Set up caching strategy
- Configure rate limiting
- Test with production-like data
- Optimize chunk size and overlap
- Implement metadata filtering
- Add confidence scoring
- Set up backup and recovery
Performance Targets
- Search Latency: < 500ms for p95
- Indexing Throughput: > 100 docs/second
- Cache Hit Rate: > 70%
- Error Rate: < 1%
- Relevance Score: > 0.7 average
What's Next?
- Explore RAG Package for implementation details
- Learn about Vector Stores for database selection
- Check out AI Package for LLM integration