Modern Data Management 2¶

New to vector similarity search?

Read the two bridge lessons first: Vector Similarity Search: First Intuition and Vector Search Indexes: Why ANN Exists.

Why this matters¶

Modern data management is no longer only about storing rows and writing SQL by hand. AI systems now sit next to databases in two important ways:

LLMs can help generate, explain, debug, and validate database queries.
Vector databases store embeddings so applications can search by meaning, not just exact keywords.

This lesson connects the database lessons with the later LLM lessons. The central idea is:

Databases are becoming part of AI systems, and AI systems increasingly depend on databases.

Mental model¶

There are two big patterns in this lecture.

First, AI as a database interface:

human question -> LLM -> SQL or NoSQL query -> database -> result -> explanation

Second, database as AI memory:

raw content -> embeddings -> vector database -> similarity search -> retrieved context -> LLM answer

The first pattern helps people ask better questions of structured data. The second pattern helps LLMs use external information without retraining.

Core ideas¶

Text-to-SQL works much better when the model receives schema context.
LLM-generated SQL should be validated before execution.
Read-only database access is the safe default for AI query assistants.
Embeddings are numerical vectors that represent semantic meaning.
Vector databases store embeddings plus the original data or metadata behind them.
Similarity search finds vectors close to a query vector.
Exact nearest-neighbor search is too expensive at large scale.
Approximate Nearest Neighbor algorithms trade a little accuracy for much faster search.
HNSW, IVF, and PQ are common ideas behind scalable vector search.
Vector databases enable semantic search, recommendations, and Retrieval-Augmented Generation.

Walkthrough¶

GenAI as a SQL copilot¶

General-purpose LLMs are often good at SQL-related tasks:

writing SQL from natural language
explaining existing SQL
translating between SQL dialects
debugging syntax errors
suggesting query optimizations

But the quality depends heavily on context. A prompt like:

Find the top 5 customers by revenue.

is underspecified. The model needs to know the schema:

table names
column names
relationships
SQL dialect
sometimes sample rows

With schema context, the model can produce much more accurate SQL.

Live SQL execution needs safety gates¶

An LLM connected to a real database is powerful, but risky. The lecture's safe pattern is:

User asks a natural-language question.
LLM generates SQL.
System validates the SQL.
Query runs through a read-only connection.
Result is returned.
LLM explains the result.

The validation step should reject dangerous statements:

INSERT
UPDATE
DELETE
DROP
ALTER
TRUNCATE

It should also restrict which tables can be queried. For a study guide, the key rule is:

Never let generated SQL execute unchecked against an important database.

Text-to-SQL tools¶

The lecture mentions several approaches and tools, including SQL-specific models, Vanna, DataLine, Wren, MindsDB, DB-GPT, and LangChain SQL chains or agents.

Do not memorize the tool list as permanent. Tools change quickly. Remember the pattern:

provide schema context
generate a query
validate it
run it safely
explain the result
recover from errors when possible

Embeddings: meaning as vectors¶

LLMs and vector databases need numerical representations of content. An embedding maps text, images, audio, or other data into a vector.

Example:

"database schema" -> [0.12, -0.44, 0.08, ...]

The vector values are not meant to be read directly by humans. They are useful because similar meanings tend to produce nearby vectors.

For text, the rough pipeline is:

text -> tokens -> neural encoder -> embedding vector -> optional normalization

For images, audio, graphs, or tabular data, the preprocessing and encoder differ, but the goal is the same: produce a fixed-length vector that supports comparison.

What a vector database stores¶

A vector database does not only store vectors. It usually stores:

the embedding vector
the original content or a pointer to it
metadata such as source, title, timestamp, category, or permissions

That matters because a search result should not just say:

vector 483920 is close to your query

It should return the underlying document, image, product, or passage that vector represents.

Similarity search¶

Vector search answers:

Which stored vectors are closest to this query vector?

This enables:

semantic text search
image similarity search
recommendation systems
"similar products"
document retrieval for RAG

Exact nearest-neighbor search compares the query vector against every stored vector. That is expensive when there are millions or billions of vectors.

This is why vector databases use Approximate Nearest Neighbor search.

Why approximate search is needed¶

High-dimensional search is hard.

In low dimensions, distances are intuitive. In high dimensions, distance becomes less discriminative: nearest and farthest points can start looking less meaningfully different. Indexes that work well for low-dimensional data also become less effective.

Approximate Nearest Neighbor algorithms accept this trade-off:

slightly less exact result -> much faster search

For many AI applications, that trade-off is acceptable. A search result that is very close and returned in milliseconds is often more useful than the mathematically perfect nearest neighbor returned too late.

HNSW: search through a small-world graph¶

HNSW stands for Hierarchical Navigable Small World.

Mental model:

Build a graph where vectors are nodes, nearby vectors are connected, and search moves through the graph toward the query.

HNSW uses layers:

upper layers are sparse and allow large jumps
lower layers are denser and refine the search

Search starts high, moves greedily toward something closer to the query, then descends layer by layer.

This is inspired by skip lists: use coarse shortcuts first, then refine.

HNSW is popular because it gives strong recall and fast search for many dynamic vector datasets.

IVF: search only promising clusters¶

IVF stands for Inverted File Index.

Mental model:

Cluster the vector space, then search only the most relevant clusters.

The rough procedure:

Use clustering, often k-means, to create centroids.
Assign each vector to its nearest centroid.
For a query, find the nearest centroids.
Search inside those clusters.

This reduces the search space. Instead of comparing against every vector, the system compares against candidates from selected clusters.

IVF can be memory-efficient and works especially well when the dataset is relatively static.

PQ: compress vectors¶

PQ stands for Product Quantization.

Mental model:

Split each vector into segments, replace each segment with a compact code, and estimate distances using those codes.

Why do this? Memory.

If every vector has hundreds or thousands of floating-point numbers, storing billions of vectors gets expensive. PQ compresses vectors so large-scale search becomes more practical.

The trade-off is accuracy. PQ estimates distances from compressed representations, so the answer is approximate.

RAG: vector databases as external memory¶

RAG stands for Retrieval-Augmented Generation.

The problem:

an LLM has fixed training knowledge
retraining constantly is expensive
prompts can only contain limited context

RAG solves this by retrieving relevant external information at question time.

user question
-> embed the question
-> search vector database
-> retrieve relevant passages
-> put passages into the prompt
-> LLM generates answer using retrieved context

This is why vector databases are important AI infrastructure. They let an LLM use current or private knowledge without baking all of it into model weights.

Explained code examples¶

Text-to-SQL safety sketch¶

FORBIDDEN = {"insert", "update", "delete", "drop", "alter", "truncate"}


def is_read_only_sql(sql):
    lowered = sql.lower()
    return lowered.strip().startswith("select") and not any(
        word in lowered for word in FORBIDDEN
    )

This is only a teaching sketch, not a full SQL validator. Real systems should parse SQL properly, restrict database permissions, use read-only credentials, and log generated queries.

The lesson idea is:

LLM output is not trusted just because it looks like SQL.

Vector search workflow¶

documents = [
    "PCA reduces dimensions by preserving variance.",
    "MongoDB stores JSON-like documents.",
    "HNSW is a graph-based ANN algorithm.",
]

# Pseudocode: exact API depends on the embedding model and vector database.
vectors = embedding_model.embed(documents)
vector_db.add(vectors=vectors, payloads=documents)

query_vector = embedding_model.embed(["How does approximate vector search work?"])
matches = vector_db.search(query_vector, top_k=3)

What this teaches:

raw content is embedded
embeddings are stored with payloads
the query is embedded too
search returns nearby stored vectors and their payloads

RAG workflow¶

question = "Why do vector databases use ANN?"

query_vector = embedding_model.embed([question])
passages = vector_db.search(query_vector, top_k=5)

prompt = build_prompt(question=question, context=passages)
answer = llm.generate(prompt)

The LLM does not search the database by itself. The application retrieves context and supplies it to the model.

Common traps¶

Text-to-SQL means no one needs to understand SQL.

You still need SQL literacy to validate generated queries, recognize wrong joins, and understand results.

If an LLM writes a query, it is safe to run.

Generated SQL needs validation, permissions, and read-only execution by default.

A vector database stores only vectors.

It also needs payloads or references to the original data, otherwise search results are not useful.

Similarity search is keyword search.

Keyword search matches terms. Vector search matches semantic closeness in embedding space.

Approximate means bad.

Approximate search is a speed/accuracy trade-off. It can be the right choice at scale.

HNSW, IVF, and PQ are interchangeable names.

They solve vector search differently: graph navigation, cluster narrowing, and compression.

RAG updates the LLM's weights.

RAG usually does not change model weights. It retrieves external context and adds it to the prompt.

Check yourself¶

Why does Text-to-SQL improve when the LLM receives schema context?

The model needs table names, column names, relationships, and dialect details to generate valid and relevant SQL.

What is the safest default database connection for LLM-generated SQL?

A read-only connection with validation that rejects dangerous statements and restricts allowed tables.

What is an embedding?

A numerical vector representation of content such as text, images, or audio, designed so similar content has nearby vectors.

Why is exact nearest-neighbor search expensive at large scale?

It compares the query vector against every stored vector, which becomes too slow for millions or billions of vectors.

What trade-off does ANN make?

It gives up guaranteed exact nearest neighbors in exchange for much faster search.

How does HNSW search?

It navigates a layered small-world graph, using sparse upper layers for fast movement and dense lower layers for refinement.

How does IVF reduce search cost?

It clusters vectors and searches only the nearest candidate clusters instead of the whole dataset.

What is the role of a vector database in RAG?

It retrieves semantically relevant external content that is added to the LLM prompt as context.

Source anchors¶

Source file: notebooks/Module2/04-Modern Data Management 2.pdf
Key source concepts: Text-to-SQL, SQL safety validation, LangChain SQL chains/agents, embeddings, vector databases, similarity search, Approximate Nearest Neighbor search, HNSW, NSW, skip lists, IVF, PQ, semantic search, recommendation systems, RAG, vector database systems
Source pages: 3-11 for GenAI and databases, 13-20 for embeddings and vector database fundamentals, 21-42 for ANN algorithms, 43-47 for applications and RAG, 56-57 for summary