SQL vs Pandas¶

Why this matters¶

SQL and Pandas often solve the same visible problem: "get the rows I need, combine tables, summarize the result." The important skill is not memorizing two APIs. The important skill is knowing where the work should happen.

Use SQL when the data already lives in a database, the task is mostly filtering, joining, aggregating, or the dataset is too large to comfortably hold in memory. Use Pandas when the data is already in Python and the task needs flexible cleaning, reshaping, exploration, plotting, or model preparation.

The notebook uses movie data to compare both styles. This lesson keeps that idea, but uses a smaller teaching example so the concepts are easier to see on a phone.

Mental model¶

Think of SQL as a request to a database engine:

"Database, find this subset, join these tables, group the result, and send me the answer."

Think of Pandas as direct manipulation of data already loaded into Python:

"I have these DataFrames in memory. I will transform them step by step."

That difference matters. SQL describes the result you want and lets the database plan the work. Pandas gives you a rich Python object, but every step happens in your Python process and usually in memory.

Core ideas¶

A SQL table and a Pandas DataFrame both represent rectangular data, but they are not the same thing.
SQL is usually stronger for querying persistent relational data.
Pandas is usually stronger for exploratory analysis, cleaning, reshaping, and model preparation.
Many operations exist in both worlds: filtering, selecting columns, joining, grouping, and aggregating.
Scale changes the decision. A Pandas workflow that is pleasant on a sample can fail when the full dataset does not fit in memory.
Moving data between SQL and Pandas is normal: query in SQL, analyze or model in Pandas.

Walkthrough¶

The tiny movie dataset¶

Imagine three tables.

movies

movie_id	title	genre
1	Toy Story	Animation
2	Matrix	Sci-Fi
3	Casablanca	Drama

users

user_id	gender	age
10	F	25
11	M	34
12	F	41

ratings

user_id	movie_id	rating
10	1	5
11	1	4
12	2	5
10	3	3

In SQL, these are database tables. In Pandas, they are DataFrames. The shape is similar, but the execution model is different.

What SQL is good at¶

SQL is designed around relational questions:

Which rows match a condition?
Which columns do I need?
How do records from different tables connect?
What summary do I get per group?

The database can use indexes, query planning, and disk-backed storage. You usually do not pull all rows into Python just to discard most of them.

What Pandas is good at¶

Pandas is designed around analysis inside Python:

Inspect a sample quickly.
Clean messy values.
Create derived columns.
Reshape data.
Feed data into plotting or machine-learning libraries.

Pandas is flexible, but that flexibility has a cost. Data must fit into memory comfortably, and chained operations are not optimized like a database query plan.

A practical rule¶

Start with this question:

Where is the data, and where should the work happen?

If the data is in a database and you need a filtered, joined, aggregated result, do that work in SQL first. If the result is now a manageable analysis table, bring it into Pandas.

If the data is already in a small CSV or notebook DataFrame and you are exploring, cleaning, or preparing features, Pandas is usually the faster way to think.

Explained code examples¶

The examples below use matching SQL and Pandas operations. The goal is to compare the thinking, not memorize every possible syntax variant.

Setup in Pandas¶

import pandas as pd

movies = pd.DataFrame({
    "movie_id": [1, 2, 3],
    "title": ["Toy Story", "Matrix", "Casablanca"],
    "genre": ["Animation", "Sci-Fi", "Drama"],
})

ratings = pd.DataFrame({
    "user_id": [10, 11, 12, 10],
    "movie_id": [1, 1, 2, 3],
    "rating": [5, 4, 5, 3],
})

This creates the same kind of rectangular data that SQL tables store. The difference is location: these DataFrames live in Python memory.

Filter rows¶

SQL:

SELECT *
FROM ratings
WHERE rating = 5;

Pandas:

ratings[ratings["rating"] == 5]

Both ask for rating rows where rating equals 5. In SQL, the database evaluates the condition. In Pandas, Python filters the in-memory DataFrame.

Select columns¶

SQL:

SELECT title, genre
FROM movies;

Pandas:

movies[["title", "genre"]]

Both keep only the title and genre columns. SQL names the desired output columns in the SELECT clause. Pandas indexes the DataFrame with a list of column names.

Join tables¶

SQL:

SELECT ratings.user_id, movies.title, ratings.rating
FROM ratings
JOIN movies
  ON ratings.movie_id = movies.movie_id;

Pandas:

ratings.merge(movies, on="movie_id")[["user_id", "title", "rating"]]

Both connect ratings to movie titles through movie_id. This is the core relational move: one table contains events or measurements, and another table explains what the IDs mean.

Group and aggregate¶

SQL:

SELECT movies.genre, AVG(ratings.rating) AS avg_rating
FROM ratings
JOIN movies
  ON ratings.movie_id = movies.movie_id
GROUP BY movies.genre;

Pandas:

ratings.merge(movies, on="movie_id").groupby("genre")["rating"].mean()

Both calculate average rating per genre. SQL expresses the grouping with GROUP BY; Pandas builds a merged DataFrame, groups by genre, selects rating, and takes the mean.

Move from SQL to Pandas¶

import sqlite3

connection = sqlite3.connect("Data/moviedata.db")

query = """
SELECT movies.title, movies.genre, ratings.rating
FROM ratings
JOIN movies
  ON ratings.movie_id = movies.movie_id
WHERE ratings.rating >= 4;
"""

analysis_df = pd.read_sql_query(query, connection)

This is often the best hybrid workflow. SQL reduces the database to the analysis table you need. Pandas then receives a smaller, cleaner DataFrame for exploration, plotting, or model preparation.

Common traps¶

Pandas is just SQL in Python.

Pandas can express many SQL-like operations, but it is not a database engine. It does not automatically give you database storage, indexes, transactions, or query planning.

If I know Pandas, I do not need SQL.

Many real datasets live in databases. Pulling entire tables into Pandas can be slow, expensive, or impossible. SQL lets you reduce the data before Python sees it.

If SQL can do it, Pandas should not.

Pandas is often the better tool once the data is already in memory and the task becomes exploratory, messy, or model-oriented.

Equivalent output means equivalent cost.

A SQL query and a Pandas chain can produce the same rows while using very different amounts of memory and compute.

More syntax variants mean better understanding.

For study, focus on the operation: filter, select, join, group, aggregate, move data. Syntax variants are secondary.

Check yourself¶

Your data is in PostgreSQL and you only need rows from last month. SQL or Pandas first?

SQL first. Filter in the database so Python receives only the relevant rows.

You have a small CSV loaded in a notebook and want to try several cleaning rules. SQL or Pandas?

Pandas. The data is already in Python, and cleaning/exploration is where Pandas is comfortable.

What is the Pandas equivalent of a SQL join?

Usually DataFrame.merge(...), with a shared key such as movie_id.

Why can a Pandas workflow fail when a SQL workflow succeeds?

Pandas usually works in memory inside one Python process. A database can use disk-backed storage, indexes, and query planning.

What is a good hybrid workflow?

Use SQL to filter, join, and aggregate large database tables; then load the smaller result into Pandas for analysis or modeling.

What is the main decision question in this lesson?

Where is the data, and where should the work happen?

Source anchors¶

Source file: notebooks/Module2/02a-SQL vs Pandas.ipynb
Source dataset: notebooks/Module2/Data/MoviesDB
Key source concepts: DataFrames versus database tables, SQLite connection, moving data between SQLite and Pandas, selection, filtering, joins, aggregation, choosing SQL or Pandas by task
Source image: study-guide/docs/assets/extracted/SQLvsPandas.jpg