Reasoning, RAG, and Agents¶
Why this matters¶
The previous lessons showed how to build and finetune language models. This lesson steps back and asks a practical question:
Three ideas matter:
reasoning -> solve multi-step problems
RAG -> answer with external knowledge
agents -> plan actions and use tools
These are not replacements for pretraining or finetuning. They are ways to turn a language model into a more capable application.
Mental model¶
Think of an LLM as the language engine inside a larger system.
LLM alone:
prompt -> model -> answer
RAG system:
prompt -> retrieve relevant context -> model -> grounded answer
agent system:
goal -> plan -> use tools -> observe results -> revise plan -> answer or act
The more complex the task, the more the surrounding system matters.
Core ideas¶
- A custom-built LLM can classify, generate, summarize, explain, and follow instructions after suitable training or finetuning.
- Reasoning tasks require multiple intermediate steps, not just fact recall.
- Inference-time scaling improves answers by spending more compute during answering without changing model weights.
- Reinforcement learning improves reasoning by updating model weights from reward signals.
- Distillation transfers behavior from a larger model into a smaller one using supervised examples.
- RAG gives the model external context at prompt time.
- RAG is often the right approach for company-specific or frequently changing knowledge.
- A RAG pipeline depends on chunking, embeddings, retrieval, ranking, and prompt construction.
- Agents combine an LLM with tools, memory, and a processing loop.
- Tool calling means the model selects a named tool and supplies structured arguments.
- Agentic RAG lets a system retrieve from multiple sources, retry retrieval, use APIs, and evaluate intermediate results.
- Frameworks such as LangChain and LlamaIndex provide building blocks, but the architecture still needs evaluation and governance.
Walkthrough¶
From language models to reasoning models¶
The course has already covered several ways to shape a model:
pretraining -> general language ability
classification finetune -> fixed-label prediction
instruction finetune -> task-following behavior
preference tuning -> responses closer to human preferences
Reasoning models go further. They are expected to combine steps.
Simple fact question:
Reasoning question:
A train leaves at 09:15 and the trip takes 2 hours 40 minutes.
If the meeting starts 25 minutes after arrival, when does it start?
The second question requires intermediate work:
The model may show those steps, hide them, or use internal intermediate computation depending on the implementation.
Three ways to improve reasoning¶
The slides highlight three broad approaches.
Inference-time compute scaling improves answers at generation time:
Examples:
- ask for step-by-step reasoning
- sample several candidate answers and choose among them
- let the model revise or verify its first answer
- run search over possible solution paths
This costs more time and compute per answer, but does not retrain the model.
Reinforcement learning changes the model:
Rewards can be broad, such as user preference, or narrow and verifiable, such as correct math answers or passing code tests.
Distillation uses a stronger model to train a smaller one:
large teacher model -> creates high-quality examples
smaller student model -> supervised finetuning on those examples
In LLM work, this often means generating instruction and reasoning examples with a larger model, then using them as supervised training data.
Why RAG exists¶
An LLM's parameters are not a live database.
If you ask a model about:
- internal company policies
- this week's support tickets
- a private PDF handbook
- current project documentation
- a customer-specific contract
the model may not know the answer from training. Even worse, it may answer confidently anyway.
Retrieval-Augmented Generation, or RAG, handles this by adding relevant external information to the prompt.
The model does not need to memorize every document. It needs to use retrieved context well.
The RAG pipeline¶
A typical RAG pipeline has an indexing phase and a query phase.
Indexing phase:
Query phase:
A tiny example:
Document:
"Refunds are allowed within 30 days if the product is unused."
User question:
"Can I return an unused item after three weeks?"
Retrieved context:
"Refunds are allowed within 30 days if the product is unused."
Answer:
"Yes. Three weeks is within 30 days, assuming the product is unused."
The value is not that the model became smarter. The value is that the model received the right evidence.
Chunking¶
Chunking means splitting documents into pieces before embedding them.
Bad chunking can break retrieval.
Too small:
The pieces may lose context.
Too large:
Retrieval may bring too much irrelevant text.
Better chunk:
Chunking choices depend on the document type:
- paragraphs for prose
- sections for handbooks
- rows or records for structured data
- table-aware chunks for tabular PDFs
- page plus surrounding heading for slide decks
Embeddings and retrieval¶
An embedding model turns text into vectors:
Similar meanings should land near each other in vector space.
At query time:
Retrieval can also use keyword methods such as BM25. Many real systems combine semantic search and keyword search because each catches different failures.
Ranking and context assembly¶
Retrieving chunks is not enough. The system must decide what to send to the model.
The prompt often contains:
Example answering rule:
That rule does not guarantee perfect behavior, but it makes the desired behavior explicit.
RAG evaluation¶
RAG can fail in several places:
- chunks are badly formed
- embeddings do not represent the question well
- retrieval misses the right document
- ranking puts weak chunks above strong chunks
- the LLM misreads the context
- the answer is correct but too slow or expensive
Evaluate the pieces separately.
Retrieval evaluation asks:
Generation evaluation asks:
Operational evaluation asks:
Multimodal documents¶
Real documents are not always plain text. They can contain:
- tables
- diagrams
- scanned images
- forms
- charts
- screenshots
- mixed PDF layouts
A simple text-only parser may miss important information. More serious RAG systems often need OCR, table extraction, layout-aware parsing, image captioning, or multimodal models.
From RAG to agentic RAG¶
Basic RAG usually follows a fixed path:
Agentic RAG gives the system more control:
For example, a support assistant might:
1. search the product manual
2. notice the manual does not mention the user's error code
3. query the issue tracker
4. call an API to check service status
5. combine the evidence into a final answer
That is more flexible than a single retrieval call, but also harder to test and govern.
What an agent is¶
An LLM agent is usually:
The LLM provides language understanding and planning. Tools provide actions.
Tools can include:
- web search
- database queries
- calculators
- code execution
- email sending
- file reading
- ticket creation
- APIs
The processing loop looks like:
Tool calling¶
Tool calling works because the model is given a tool description.
Example:
Tool name: search_orders
Description: Find customer orders by customer ID.
Parameters:
customer_id: string
The model can then produce a structured call:
The application executes the tool and returns the result to the model.
The model should not be trusted to execute arbitrary actions directly. The application layer controls which tools exist, what arguments are allowed, and whether a human approval step is needed.
Memory and human input¶
Agents may use memory to store useful information across steps or sessions:
short-term memory -> current task state
long-term memory -> durable facts, preferences, previous outcomes
Human-in-the-loop means the agent pauses when a human decision is needed:
This is especially important for expensive, irreversible, private, or risky actions.
Agent protocols and frameworks¶
The slides mention two emerging protocol ideas:
- MCP: a standard way for LLM applications to access tools and resources
- A2A: a standard for agent-to-agent communication
The big idea is interoperability. Instead of hardwiring every tool integration into every agent, protocols define common interfaces.
Frameworks provide implementation building blocks:
- LangChain
- LangGraph
- LangFlow
- AutoGen
- Semantic Kernel
- LlamaIndex
- AutoGPT
- CrewAI
- PydanticAI
- Spring AI
- Haystack
Frameworks help with orchestration, but they do not remove the need to understand your data, permissions, evaluation, and failure modes.
LangChain and LlamaIndex¶
LangChain is a general LLM application framework. Its common building blocks include:
- model interfaces
- prompt templates
- document loaders
- text splitters
- vector stores
- chains
- agents
- tracing and evaluation tools
LlamaIndex is more focused on connecting private data to LLMs for retrieval and query applications.
Rule of thumb:
general workflow or agent orchestration -> LangChain or LangGraph
document indexing and retrieval Q&A -> LlamaIndex
In practice, either can be used for many tasks. The right choice depends on the system shape and team preference.
Common traps¶
Do not assume reasoning is just longer output
A long answer can still be wrong. Reasoning quality depends on whether the intermediate steps are valid and lead to the answer.
Do not use RAG as a substitute for clean data
Retrieval quality depends on the source documents, chunking, metadata, parsing, and indexing. Messy inputs create messy answers.
Do not assume the retrieved chunks are correct
Retrieval can return outdated, irrelevant, or conflicting context. Ranking and source validation matter.
Do not send unlimited context to the model
More context can increase cost, latency, and confusion. Select the most relevant evidence deliberately.
Do not treat agents as magic automation
Agents are loops around an LLM. They need tool permissions, error handling, logging, evaluation, and human approval for risky actions.
Do not let tool calls bypass security
The application should validate tool arguments, restrict access, and require approval for irreversible or sensitive operations.
Do not choose a framework before knowing the workflow
LangChain, LlamaIndex, and similar tools are useful, but the system design should come from the data sources, task, evaluation needs, and deployment constraints.
Check yourself¶
What is inference-time compute scaling?
It is spending more computation during answering, such as sampling, reasoning, checking, or search, without changing the model weights.
How is reinforcement learning different from inference-time scaling?
Reinforcement learning updates model weights using reward signals. Inference-time scaling leaves weights unchanged and spends more compute while answering.
What problem does RAG solve?
It gives the model relevant external context at prompt time, which is useful for private, current, or domain-specific knowledge.
What are the main steps in a basic RAG pipeline?
Chunk documents, embed chunks, store embeddings, embed the query, retrieve relevant chunks, rank them, add top chunks to the prompt, and generate an answer.
Why can chunking make or break RAG?
If chunks are too small they lose context; if they are too large they add noise. Good chunks preserve answerable units of meaning.
What makes agentic RAG different from simple RAG?
Agentic RAG can choose tools or sources, retrieve more than once, evaluate intermediate results, and revise its plan before answering.
What is tool calling?
Tool calling is when an LLM chooses a named external function and supplies structured arguments so the application can execute it.
Why is human-in-the-loop useful for agents?
It lets the system pause for approval or judgment before risky, expensive, private, or irreversible actions.
Source anchors¶
notebooks/Module2/19-Reasoning, RAG, and Agents.pdfstudy-guide/drafts/19-reasoning-rag-and-agents.md