Skip to content

Interfacing Python and MongoDB

Why this matters

MySQL stores data in tables. MongoDB stores data as documents. That changes how you model data, query data, and move data into Python.

This lesson teaches the practical workflow:

connect to MongoDB, choose a database and collection, insert documents, query documents, project fields, sort/limit results, and convert selected documents to Pandas when analysis needs it.

The notebook uses movie JSON data and an Airbnb dataset. The important part is not every cleaning cell. The important part is learning how document-shaped data flows between MongoDB, PyMongo, and Pandas.

Mental model

MongoDB is a document database.

MongoDB server
  database
    collection
      document
        field

A document is similar to a JSON object:

{
    "title": "MongoDB and Python",
    "tags": ["mongodb", "database", "NoSQL"],
    "viewers": 104
}

Unlike a relational row, a document can naturally contain nested objects and arrays. That is MongoDB's main comfort zone: flexible, nested, semi-structured data.

Core ideas

  • MongoDB stores JSON-like documents in collections.
  • A collection is roughly comparable to a table, but documents in the same collection do not need identical fields.
  • PyMongo is the common Python driver for MongoDB.
  • MongoClient connects Python to the MongoDB server.
  • insert_one inserts one document; insert_many inserts many documents.
  • find returns a cursor, not a plain list.
  • A query is usually a Python dictionary.
  • A projection selects which fields to include or exclude.
  • Operators such as $gt, $exists, and $all express richer conditions.
  • Convert MongoDB results to Pandas only after selecting a manageable, analysis-ready subset.

Walkthrough

Documents vs relational rows

In a relational database, related information is often split across tables.

In MongoDB, related information can be embedded inside one document.

listing = {
    "name": "Small apartment near the center",
    "address": {
        "country": "Spain",
        "market": "Barcelona"
    },
    "amenities": ["Wifi", "Kitchen", "Washer"],
    "review_scores": {
        "review_scores_rating": 92
    }
}

This shape would be awkward as one flat SQL row. In MongoDB, nested fields and arrays are normal.

MongoDB document example

Connect with PyMongo

Install the driver:

python3 -m pip install pymongo

Then connect:

import os
import pymongo

connection_string = os.environ.get(
    "MONGODB_URI",
    "mongodb://localhost:27017/",
)

client = pymongo.MongoClient(connection_string, serverSelectionTimeoutMS=3000)
client.admin.command("ping")

ping is a quick connection check. If it fails, check whether MongoDB is running, whether the host/port are correct, and whether credentials are valid.

The source notebook uses a classroom Docker connection string. Treat those credentials as local demo credentials, not a habit for real projects.

Choose database and collection

db = client["sampledb"]
collection = db["samplecollection"]

MongoDB creates databases and collections lazily. They usually appear once data is inserted.

Insert one document

record = {
    "title": "MongoDB and Python",
    "description": "A document database example",
    "tags": ["mongodb", "database", "NoSQL"],
    "viewers": 104,
}

result = collection.insert_one(record)
print(result.inserted_id)

If you do not provide _id, MongoDB adds one. _id is the document's unique identifier, similar in purpose to a primary key.

Insert many JSON documents

import json

with open("Data/movies.json", encoding="utf-8") as file:
    movies = json.load(file)

if isinstance(movies, list):
    collection.insert_many(movies)
else:
    collection.insert_one(movies)

This matches the notebook's movie example. JSON maps naturally into MongoDB documents.

Find documents

Query objects are dictionaries.

query = {"year": 1900}

cursor = collection.find(query)

for movie in cursor:
    print(movie)

Important: find returns a cursor. It does not immediately give you a printed list. You iterate over it, limit it, sort it, or convert it when appropriate.

Count, limit, and sort

count = collection.count_documents({"year": 1900})
for movie in collection.find({"year": 1900}).limit(5):
    print(movie)
for movie in collection.find().sort("title", 1).limit(10):
    print(movie["title"])

Sort direction 1 means ascending. Sort direction -1 means descending.

Use query operators

MongoDB query operators are written inside dictionaries.

query = {"title": {"$gt": "S"}}

This finds titles alphabetically greater than "S".

For arrays:

query = {"amenities": {"$all": ["Wifi", "TV"]}}

This finds documents whose amenities array contains both "Wifi" and "TV".

For nested fields, use dot notation:

query = {
    "address.country": {"$in": ["Italy", "Spain"]},
    "review_scores.review_scores_rating": {"$gte": 90},
}

Dot notation lets you query inside embedded documents.

Use projection to control output

The second argument to find is a projection: which fields to include or exclude.

projection = {
    "_id": 0,
    "listing_url": 1,
    "name": 1,
    "price": 1,
    "address.country": 1,
}

results = collection.find({}, projection).limit(10)

This is the MongoDB version of saying:

Do not send me every field. Send only the fields I need.

That matters because documents can be large and nested.

Convert selected documents to Pandas

import pandas as pd

cursor = collection.find(
    {"year": 1900},
    {"_id": 0, "title": 1, "year": 1, "genres": 1},
)

df_movies = pd.DataFrame(list(cursor))

This is useful, but use it carefully. list(cursor) loads all selected documents into memory. Filter and project first, then convert.

For deeply nested documents, Pandas may create columns that still contain dictionaries or lists. You may need explicit flattening:

df["country"] = df["address"].apply(lambda value: value.get("country"))

That flattening work is a major part of the Airbnb section in the notebook.

Explained code examples

Safe connection helper

import os
import pymongo


def connect_to_mongodb():
    uri = os.environ.get("MONGODB_URI", "mongodb://localhost:27017/")
    return pymongo.MongoClient(uri, serverSelectionTimeoutMS=3000)

What this teaches:

  • keep the connection string outside source code when it contains credentials
  • set a timeout so connection problems fail quickly
  • return a client that can access multiple databases

Query Airbnb-style nested fields

query = {
    "amenities": {"$all": ["Wifi", "Washer"]},
    "address.country": {"$in": ["Italy", "Spain"]},
    "review_scores.review_scores_rating": {"$gte": 90},
}

projection = {
    "_id": 0,
    "name": 1,
    "listing_url": 1,
    "address.country": 1,
    "review_scores.review_scores_rating": 1,
    "price": 1,
}

results = db.listingsAndReviews.find(query, projection).limit(10)

What this teaches:

  • arrays can be queried with $all
  • nested fields can be queried with dot notation
  • comparison operators such as $gte work inside query dictionaries
  • projection keeps the result readable

MongoDB to Pandas, but only after reduction

cursor = db.listingsAndReviews.find(query, projection).limit(100)
df = pd.DataFrame(list(cursor))

This is the bridge to analysis. MongoDB handles the document query. Pandas handles the tabular analysis after you have a reasonably sized result.

Common traps

MongoDB is just SQL with different syntax.

MongoDB is document-oriented. Think in documents, embedded fields, arrays, and collections rather than normalized tables first.

Schema-less means no structure.

MongoDB is flexible, but your application still depends on expected fields and shapes. Flexible schema is not a license for messy data.

find returns the data.

find returns a cursor. You need to iterate over it, limit it, or convert it.

Convert the whole collection to Pandas.

Filter and project first. Large collections and nested documents can overwhelm memory or create awkward DataFrames.

Projection is optional detail.

Projection is how you keep document queries readable and efficient. Ask for only the fields you need.

Arrays and nested objects behave like flat columns.

They do not. Use array operators and dot notation, then flatten deliberately if you need a DataFrame.

Document databases never have relationships.

They can reference related data, but the modeling style usually prefers embedding data that is read together.

Check yourself

What is the MongoDB equivalent of a table?

A collection. It contains documents.

What is a MongoDB document?

A JSON-like record made of fields, values, nested objects, and arrays.

What does PyMongo's MongoClient represent?

A client connection object that lets Python access MongoDB databases and collections.

What does find return?

A cursor over matching documents, not an immediate plain list.

Why use projection in find?

To include only the fields needed for the task and avoid transferring bulky documents.

How do you query a nested field such as country inside address?

Use dot notation, for example "address.country": "Spain".

When should MongoDB results become a Pandas DataFrame?

After the MongoDB query has filtered and projected the data to a manageable analysis-ready subset.

What is the main modeling difference from MySQL?

MySQL usually normalizes data across related tables; MongoDB often embeds related data inside flexible documents.

Source anchors

  • Source file: notebooks/Module2/03-Interfacing Python and MongoDB.ipynb
  • Source datasets: notebooks/Module2/Data/movies.json, notebooks/Module2/Data/listingsAndReviews.json
  • Key source concepts: MongoDB document database, PyMongo, connection strings, databases, collections, documents, _id, insert_one, insert_many, find, cursors, count_documents, limit, sort, query operators, projection, nested fields, arrays, MongoDB to Pandas conversion
  • Source images: study-guide/docs/assets/extracted/mongoexample.jpg, study-guide/docs/assets/extracted/mongodb_find.jpg