Database Setup - Supabase with pgvector¶
This document contains the SQL schema and functions required to set up the RAG (Retrieval-Augmented Generation) system using Supabase with pgvector for vector similarity search.
Overview¶
The RAG system uses three main tables:
- searches: Logs user search queries and activity
- documents: Stores policy documents with metadata
- chunks: Stores text chunks with vector embeddings for similarity search
Prerequisites¶
- Enable the
pgvectorextension in your Supabase project - Run the SQL commands below in the Supabase SQL editor
1. Enable pgvector Extension¶
-- Enable the pgvector extension for vector similarity search
CREATE EXTENSION IF NOT EXISTS vector;
2. Searches Table¶
The searches table logs user search queries for analytics and history tracking.
-- Create searches table for query logging
CREATE TABLE searches (
search_id TEXT PRIMARY KEY,
project_id TEXT NOT NULL,
search_query TEXT NOT NULL,
user_id TEXT NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- Create indexes for better query performance
CREATE INDEX idx_searches_project_id ON searches(project_id);
CREATE INDEX idx_searches_user_id ON searches(user_id);
CREATE INDEX idx_searches_created_at ON searches(created_at);
3. Documents Table¶
The documents table stores the policy documents collected from search results.
-- Create documents table
CREATE TABLE documents (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
project_id TEXT NOT NULL,
external_id TEXT,
title TEXT NOT NULL,
authors TEXT[],
abstract TEXT,
content TEXT,
doi TEXT,
overton_url TEXT,
source_country TEXT,
source_type TEXT,
published_date TIMESTAMP,
confidence DECIMAL,
relevance_reason TEXT,
is_relevant BOOLEAN DEFAULT true,
top_line TEXT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Create indexes for better query performance
CREATE INDEX idx_documents_project_id ON documents(project_id);
CREATE INDEX idx_documents_external_id ON documents(external_id);
CREATE INDEX idx_documents_created_at ON documents(created_at);
4. Chunks Table¶
The chunks table stores document chunks with vector embeddings for semantic search.
-- Create chunks table with vector embeddings
CREATE TABLE chunks (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
project_id TEXT NOT NULL,
content TEXT NOT NULL,
chunk_type TEXT NOT NULL DEFAULT 'summary',
chunk_index INTEGER DEFAULT 0,
embedding VECTOR(1536), -- OpenAI ada-002 embedding dimension
token_count INTEGER,
created_at TIMESTAMP DEFAULT NOW()
);
-- Create indexes for vector similarity search
CREATE INDEX idx_chunks_project_id ON chunks(project_id);
CREATE INDEX idx_chunks_document_id ON chunks(document_id);
CREATE INDEX idx_chunks_embedding ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
5. Vector Similarity Search Function¶
This PostgreSQL function performs efficient vector similarity search with project filtering.
-- Create function for vector similarity search
CREATE OR REPLACE FUNCTION match_chunks(
query_embedding VECTOR(1536),
match_threshold FLOAT DEFAULT 0.8,
match_count INT DEFAULT 5,
project_filter TEXT DEFAULT NULL
)
RETURNS TABLE (
id UUID,
document_id UUID,
content TEXT,
chunk_type TEXT,
similarity FLOAT,
document_title TEXT,
project_id TEXT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
c.id,
c.document_id,
c.content,
c.chunk_type,
1 - (c.embedding <=> query_embedding) AS similarity,
d.title AS document_title,
c.project_id
FROM chunks c
LEFT JOIN documents d ON c.document_id = d.id
WHERE
(project_filter IS NULL OR c.project_id = project_filter)
AND (1 - (c.embedding <=> query_embedding)) > match_threshold
ORDER BY c.embedding <=> query_embedding
LIMIT match_count;
END;
$$;
6. Row Level Security (Optional)¶
If you want to enable Row Level Security (RLS) for multi-tenant applications:
-- Enable RLS on both tables
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
-- Create policies (example for project-based access)
-- Note: Adjust these policies based on your authentication setup
-- Documents policy
CREATE POLICY "Users can access documents in their projects" ON documents
FOR ALL USING (
project_id IN (
SELECT project_id FROM user_projects
WHERE user_id = auth.uid()
)
);
-- Chunks policy
CREATE POLICY "Users can access chunks in their projects" ON chunks
FOR ALL USING (
project_id IN (
SELECT project_id FROM user_projects
WHERE user_id = auth.uid()
)
);
7. Environment Variables¶
Ensure these environment variables are set in your backend:
# Supabase Configuration
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key
# OpenAI for embeddings
OPENAI_API_KEY=your_openai_api_key
Usage in Application¶
The Python application uses these tables through the VectorizationService:
Storing Documents¶
# Store search results with vectorization
await vectorization_service.store_search_results(papers, project_id)
Logging Search Queries¶
# Log user search activity
search_id = await logging_service.log_search(
project_id="test_project",
search_query="refined user query",
user_id="clerk_user_id"
)
Searching Similar Content¶
# Find relevant documents for user query
relevant_docs = await vectorization_service.search_similar_content(
query="user question",
project_id="test_project",
match_threshold=0.7,
match_count=5
)
Getting Search History¶
# Retrieve recent searches for a project
history = await logging_service.get_search_history(
project_id="test_project",
limit=10
)
Performance Considerations¶
- Vector Index: The
ivfflatindex on embeddings provides approximate nearest neighbor search - Lists Parameter: Set to 100 for small datasets, increase for larger ones
- Embedding Dimensions: Fixed at 1536 for OpenAI's
text-embedding-ada-002model - Similarity Threshold: Adjust
match_thresholdbased on your quality requirements
Data Flow¶
- Search → Overton API returns policy documents
- Screening → AI filters for relevance
- Vectorization → Generate embeddings for document summaries
- Storage → Store in
documentsandchunkstables - RAG Query → Use
match_chunks()function for similarity search - Response → Generate cited answers with document links
This setup provides a robust foundation for semantic search over policy documents with proper citation and source linking capabilities.