Reasoning, RAG, and Agents¶

Why this matters¶

The previous lessons showed how to build and finetune language models. This lesson steps back and asks a practical question:

How do we make an LLM useful for harder real-world tasks?

Three ideas matter:

reasoning -> solve multi-step problems
RAG       -> answer with external knowledge
agents    -> plan actions and use tools

These are not replacements for pretraining or finetuning. They are ways to turn a language model into a more capable application.

Mental model¶

Think of an LLM as the language engine inside a larger system.

LLM alone:
prompt -> model -> answer

RAG system:
prompt -> retrieve relevant context -> model -> grounded answer

agent system:
goal -> plan -> use tools -> observe results -> revise plan -> answer or act

The more complex the task, the more the surrounding system matters.

Core ideas¶

A custom-built LLM can classify, generate, summarize, explain, and follow instructions after suitable training or finetuning.
Reasoning tasks require multiple intermediate steps, not just fact recall.
Inference-time scaling improves answers by spending more compute during answering without changing model weights.
Reinforcement learning improves reasoning by updating model weights from reward signals.
Distillation transfers behavior from a larger model into a smaller one using supervised examples.
RAG gives the model external context at prompt time.
RAG is often the right approach for company-specific or frequently changing knowledge.
A RAG pipeline depends on chunking, embeddings, retrieval, ranking, and prompt construction.
Agents combine an LLM with tools, memory, and a processing loop.
Tool calling means the model selects a named tool and supplies structured arguments.
Agentic RAG lets a system retrieve from multiple sources, retry retrieval, use APIs, and evaluate intermediate results.
Frameworks such as LangChain and LlamaIndex provide building blocks, but the architecture still needs evaluation and governance.

Walkthrough¶

From language models to reasoning models¶

The course has already covered several ways to shape a model:

pretraining             -> general language ability
classification finetune -> fixed-label prediction
instruction finetune    -> task-following behavior
preference tuning       -> responses closer to human preferences

Reasoning models go further. They are expected to combine steps.

Simple fact question:

What is the capital of France?

Reasoning question:

A train leaves at 09:15 and the trip takes 2 hours 40 minutes.
If the meeting starts 25 minutes after arrival, when does it start?

The second question requires intermediate work:

09:15 + 2:40 = 11:55
11:55 + 0:25 = 12:20

The model may show those steps, hide them, or use internal intermediate computation depending on the implementation.

Three ways to improve reasoning¶

The slides highlight three broad approaches.

Inference-time compute scaling improves answers at generation time:

same model weights + more answering effort -> often better answer

Examples:

ask for step-by-step reasoning
sample several candidate answers and choose among them
let the model revise or verify its first answer
run search over possible solution paths

This costs more time and compute per answer, but does not retrain the model.

Reinforcement learning changes the model:

model tries actions -> receives reward -> weights update

Rewards can be broad, such as user preference, or narrow and verifiable, such as correct math answers or passing code tests.

Distillation uses a stronger model to train a smaller one:

large teacher model -> creates high-quality examples
smaller student model -> supervised finetuning on those examples

In LLM work, this often means generating instruction and reasoning examples with a larger model, then using them as supervised training data.

Why RAG exists¶

An LLM's parameters are not a live database.

If you ask a model about:

internal company policies
this week's support tickets
a private PDF handbook
current project documentation
a customer-specific contract

the model may not know the answer from training. Even worse, it may answer confidently anyway.

Retrieval-Augmented Generation, or RAG, handles this by adding relevant external information to the prompt.

question -> retrieve relevant documents -> add them as context -> generate answer

The model does not need to memorize every document. It needs to use retrieved context well.

The RAG pipeline¶

A typical RAG pipeline has an indexing phase and a query phase.

Indexing phase:

documents -> chunks -> embeddings -> vector index/database

Query phase:

user query -> query embedding -> retrieve chunks -> rank chunks -> prompt LLM -> answer

A tiny example:

Document:
"Refunds are allowed within 30 days if the product is unused."

User question:
"Can I return an unused item after three weeks?"

Retrieved context:
"Refunds are allowed within 30 days if the product is unused."

Answer:
"Yes. Three weeks is within 30 days, assuming the product is unused."

The value is not that the model became smarter. The value is that the model received the right evidence.

Chunking¶

Chunking means splitting documents into pieces before embedding them.

Bad chunking can break retrieval.

Too small:

"Refunds are allowed"
"within 30 days"
"if the product is unused"

The pieces may lose context.

Too large:

whole 200-page policy manual

Retrieval may bring too much irrelevant text.

Better chunk:

"Refunds are allowed within 30 days if the product is unused."

Chunking choices depend on the document type:

paragraphs for prose
sections for handbooks
rows or records for structured data
table-aware chunks for tabular PDFs
page plus surrounding heading for slide decks

Embeddings and retrieval¶

An embedding model turns text into vectors:

"refund within 30 days" -> [0.12, -0.04, 0.88, ...]

Similar meanings should land near each other in vector space.

At query time:

query -> query vector -> nearest chunks

Retrieval can also use keyword methods such as BM25. Many real systems combine semantic search and keyword search because each catches different failures.

Ranking and context assembly¶

Retrieving chunks is not enough. The system must decide what to send to the model.

candidate chunks -> ranking -> top chunks -> prompt context

The prompt often contains:

system instruction
retrieved context
user question
answering rules

Example answering rule:

Answer only from the provided context. If the context is insufficient, say so.

That rule does not guarantee perfect behavior, but it makes the desired behavior explicit.

RAG evaluation¶

RAG can fail in several places:

chunks are badly formed
embeddings do not represent the question well
retrieval misses the right document
ranking puts weak chunks above strong chunks
the LLM misreads the context
the answer is correct but too slow or expensive

Evaluate the pieces separately.

Retrieval evaluation asks:

Did the system retrieve the documents that contain the answer?

Generation evaluation asks:

Given the retrieved context, did the model answer correctly and cite/use the evidence?

Operational evaluation asks:

Is the system fast, cheap, secure, and reliable enough?

Multimodal documents¶

Real documents are not always plain text. They can contain:

tables
diagrams
scanned images
forms
charts
screenshots
mixed PDF layouts

A simple text-only parser may miss important information. More serious RAG systems often need OCR, table extraction, layout-aware parsing, image captioning, or multimodal models.

From RAG to agentic RAG¶

Basic RAG usually follows a fixed path:

retrieve once -> answer once

Agentic RAG gives the system more control:

decide source -> retrieve -> inspect result -> retrieve again if needed -> use tool -> answer

For example, a support assistant might:

1. search the product manual
2. notice the manual does not mention the user's error code
3. query the issue tracker
4. call an API to check service status
5. combine the evidence into a final answer

That is more flexible than a single retrieval call, but also harder to test and govern.

What an agent is¶

An LLM agent is usually:

LLM + tools + memory + processing loop

The LLM provides language understanding and planning. Tools provide actions.

Tools can include:

web search
database queries
calculators
code execution
email sending
file reading
ticket creation
APIs

The processing loop looks like:

receive goal
plan next step
call a tool
observe result
update plan
repeat until done

Tool calling¶

Tool calling works because the model is given a tool description.

Example:

Tool name: search_orders
Description: Find customer orders by customer ID.
Parameters:
  customer_id: string

The model can then produce a structured call:

{
  "tool": "search_orders",
  "arguments": {
    "customer_id": "C1024"
  }
}

The application executes the tool and returns the result to the model.

The model should not be trusted to execute arbitrary actions directly. The application layer controls which tools exist, what arguments are allowed, and whether a human approval step is needed.

Memory and human input¶

Agents may use memory to store useful information across steps or sessions:

short-term memory -> current task state
long-term memory  -> durable facts, preferences, previous outcomes

Human-in-the-loop means the agent pauses when a human decision is needed:

draft email -> ask human to approve -> send only after approval

This is especially important for expensive, irreversible, private, or risky actions.

Agent protocols and frameworks¶

The slides mention two emerging protocol ideas:

MCP: a standard way for LLM applications to access tools and resources
A2A: a standard for agent-to-agent communication

The big idea is interoperability. Instead of hardwiring every tool integration into every agent, protocols define common interfaces.

Frameworks provide implementation building blocks:

LangChain
LangGraph
LangFlow
AutoGen
Semantic Kernel
LlamaIndex
AutoGPT
CrewAI
PydanticAI
Spring AI
Haystack

Frameworks help with orchestration, but they do not remove the need to understand your data, permissions, evaluation, and failure modes.

LangChain and LlamaIndex¶

LangChain is a general LLM application framework. Its common building blocks include:

model interfaces
prompt templates
document loaders
text splitters
vector stores
chains
agents
tracing and evaluation tools

LlamaIndex is more focused on connecting private data to LLMs for retrieval and query applications.

Rule of thumb:

general workflow or agent orchestration -> LangChain or LangGraph
document indexing and retrieval Q&A     -> LlamaIndex

In practice, either can be used for many tasks. The right choice depends on the system shape and team preference.

Common traps¶

Do not assume reasoning is just longer output

A long answer can still be wrong. Reasoning quality depends on whether the intermediate steps are valid and lead to the answer.

Do not use RAG as a substitute for clean data

Retrieval quality depends on the source documents, chunking, metadata, parsing, and indexing. Messy inputs create messy answers.

Do not assume the retrieved chunks are correct

Retrieval can return outdated, irrelevant, or conflicting context. Ranking and source validation matter.

Do not send unlimited context to the model

More context can increase cost, latency, and confusion. Select the most relevant evidence deliberately.

Do not treat agents as magic automation

Agents are loops around an LLM. They need tool permissions, error handling, logging, evaluation, and human approval for risky actions.

Do not let tool calls bypass security

The application should validate tool arguments, restrict access, and require approval for irreversible or sensitive operations.

Do not choose a framework before knowing the workflow

LangChain, LlamaIndex, and similar tools are useful, but the system design should come from the data sources, task, evaluation needs, and deployment constraints.

Check yourself¶

What is inference-time compute scaling?

It is spending more computation during answering, such as sampling, reasoning, checking, or search, without changing the model weights.

How is reinforcement learning different from inference-time scaling?

Reinforcement learning updates model weights using reward signals. Inference-time scaling leaves weights unchanged and spends more compute while answering.

What problem does RAG solve?

It gives the model relevant external context at prompt time, which is useful for private, current, or domain-specific knowledge.

What are the main steps in a basic RAG pipeline?

Chunk documents, embed chunks, store embeddings, embed the query, retrieve relevant chunks, rank them, add top chunks to the prompt, and generate an answer.

Why can chunking make or break RAG?

If chunks are too small they lose context; if they are too large they add noise. Good chunks preserve answerable units of meaning.

What makes agentic RAG different from simple RAG?

Agentic RAG can choose tools or sources, retrieve more than once, evaluate intermediate results, and revise its plan before answering.

What is tool calling?

Tool calling is when an LLM chooses a named external function and supplies structured arguments so the application can execute it.

Why is human-in-the-loop useful for agents?

It lets the system pause for approval or judgment before risky, expensive, private, or irreversible actions.

Source anchors¶

notebooks/Module2/19-Reasoning, RAG, and Agents.pdf
study-guide/drafts/19-reasoning-rag-and-agents.md