Skip to content

Introduction to PyTorch

Why this matters

The previous lesson implemented a neural network by hand with NumPy. PyTorch gives you the same core workflow with better tools:

tensors -> computation graph -> automatic gradients -> optimizer updates

This matters because later lessons build toward language models. Those models are too large and too complex to train comfortably with hand-written NumPy backpropagation.

Mental model

PyTorch is three things at once:

PyTorch overview

  • a tensor library, like NumPy with GPU support
  • an automatic differentiation engine, called autograd
  • a neural-network toolkit, with layers, losses, optimizers, datasets, and training utilities

The main shift from NumPy is that PyTorch can remember tensor operations and compute gradients automatically.

Core ideas

  • A tensor is a numeric array: scalar, vector, matrix, or higher-dimensional block.
  • PyTorch tensors can live on CPU or GPU.
  • Tensor shape and dtype must match the operation you want to run.
  • requires_grad=True tells PyTorch to track operations for gradient computation.
  • loss.backward() computes gradients for trainable parameters.
  • optimizer.step() updates parameters using those gradients.
  • optimizer.zero_grad() clears old gradients before the next update.
  • torch.nn.Module is the standard way to define reusable models.
  • Dataset returns individual examples; DataLoader batches, shuffles, and loads them.
  • During inference, use model.eval() and torch.no_grad() or torch.inference_mode().
  • Save learned weights with state_dict, not by relying on a live Python object.

Walkthrough

Installing and checking PyTorch

The notebook begins with installation and GPU checks.

import torch

torch.cuda.is_available()

If this returns True, PyTorch can see a CUDA-capable GPU. If it returns False, the code can still run on CPU, just slower for larger models.

In practice, use the official PyTorch install selector for your operating system and CUDA version. The exact install command changes depending on your machine.

Tensors

Tensors generalize arrays:

Tensor ranks

0D tensor -> scalar
1D tensor -> vector
2D tensor -> matrix
3D+ tensor -> stacked numeric blocks

Create tensors from Python lists or NumPy arrays:

import numpy as np
import torch

a = [1, 2, 3]
b = np.array([4, 5, 6], dtype=np.int32)

t_a = torch.tensor(a)
t_b = torch.from_numpy(b)

Create common tensors directly:

ones = torch.ones(2, 3)
random_values = torch.rand(2, 3)

For images and deep-learning batches, a common convention is:

[batch, channels, height, width]

For text models, you will often see:

[batch, context_length, embedding_dim]

Shape and dtype operations

Most PyTorch errors are shape or dtype errors. The notebook introduces the essential tools:

t = torch.rand(3, 5)
t_transposed = torch.transpose(t, 0, 1)
t = torch.zeros(30)
t_reshaped = torch.reshape(t, (5, 6))
t = torch.zeros(1, 2, 1, 4, 1)
t_squeezed = torch.squeeze(t)

Use .to(...) to change dtype or device:

t_int64 = t_a.to(torch.int64)

Later you will also see:

features = features.to(device)
labels = labels.to(device)

That moves tensors to CPU or GPU.

Tensor math

PyTorch supports elementwise math, reductions, matrix multiplication, splitting, stacking, and concatenation.

Examples:

product = torch.multiply(t1, t2)
column_means = torch.mean(t1, dim=0)
matrix_product = t1 @ t2.T
row_norms = torch.linalg.norm(t1, ord=2, dim=1)

The pattern to watch:

dim=0 -> operate down rows, one result per column
dim=1 -> operate across columns, one result per row

For classification outputs, this is why torch.argmax(logits, dim=1) means "choose the best class for each example."

Computation graphs

PyTorch builds computation graphs from tensor operations.

Computation graph

A tiny logistic-regression-like example:

import torch
import torch.nn.functional as F

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b
a = torch.sigmoid(z)
loss = F.binary_cross_entropy(a, y)

Because w1 and b have requires_grad=True, PyTorch tracks how loss depends on them.

Autograd

Autograd is PyTorch's automatic differentiation system.

Automatic differentiation

The usual training pattern is:

loss.backward()

print(w1.grad)
print(b.grad)

After backward, each tracked parameter stores its gradient in .grad.

That replaces the hand-written .backward() method from the NumPy MLP lesson.

Defining a model with nn.Sequential

For simple feedforward models, torch.nn.Sequential is concise:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(input_features, 30),
    nn.ReLU(),
    nn.Linear(30, 15),
    nn.ReLU(),
    nn.Linear(15, 3),
)

Important: for multiclass classification with nn.CrossEntropyLoss, the model should return raw logits. Do not add Softmax as the final training layer. CrossEntropyLoss applies the needed log-softmax internally in a numerically stable way.

Use softmax later only if you want to display probabilities:

probas = torch.softmax(logits, dim=1)

Defining a model with nn.Module

For reusable models, subclass torch.nn.Module.

class NeuralNetwork(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(num_inputs, 30),
            nn.ReLU(),
            nn.Linear(30, 20),
            nn.ReLU(),
            nn.Linear(20, num_outputs),
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits

What this teaches:

  • __init__ defines the trainable layers
  • forward defines how inputs flow through the model
  • you normally do not implement backward
  • PyTorch tracks parameters from layers such as nn.Linear

Count trainable parameters:

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

Device handling

PyTorch tensors and models must be on the same device.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)
features = features.to(device)
labels = labels.to(device)

If the model is on GPU but the data is on CPU, operations fail. Move both consistently.

Dataset and DataLoader

The notebook introduces PyTorch's data pipeline:

DataLoader principle

  • Dataset: knows how to return one example
  • DataLoader: turns examples into batches and handles shuffling

A minimal custom dataset:

from torch.utils.data import Dataset

class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        return self.features[index], self.labels[index]

    def __len__(self):
        return self.labels.shape[0]

Wrap it in a loader:

from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    drop_last=True,
)

Why this matters:

  • batching keeps memory use controlled
  • shuffling changes example order each epoch
  • drop_last=True avoids tiny final batches
  • num_workers can parallelize data loading for larger datasets

Training loop

A standard PyTorch training loop has a stable shape:

import torch.nn.functional as F

optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for epoch in range(num_epochs):
    model.train()

    for features, labels in train_loader:
        features = features.to(device)
        labels = labels.to(device)

        logits = model(features)
        loss = F.cross_entropy(logits, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Read it as:

training mode -> forward pass -> loss -> clear old gradients -> backward pass -> update

The order of zero_grad, backward, and step matters.

Evaluation and inference

For evaluation:

model.eval()

with torch.no_grad():
    logits = model(features)
    predictions = torch.argmax(logits, dim=1)

Use model.eval() because layers like dropout and batch normalization behave differently during training and evaluation.

Use torch.no_grad() or torch.inference_mode() because you do not need gradients for evaluation. That saves memory and computation.

Accuracy function

A reusable accuracy function loops over a dataloader:

def compute_accuracy(model, dataloader, device):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for features, labels in dataloader:
            features = features.to(device)
            labels = labels.to(device)

            logits = model(features)
            predictions = torch.argmax(logits, dim=1)

            correct += torch.sum(predictions == labels).item()
            total += labels.shape[0]

    return correct / total

This scales better than trying to evaluate an entire large dataset at once.

Saving and loading

The recommended basic pattern is to save the model's state_dict.

torch.save(model.state_dict(), "model.pth")

Load it into a model with the same architecture:

model = NeuralNetwork(num_inputs=2, num_outputs=2)
model.load_state_dict(torch.load("model.pth"))
model.eval()

state_dict stores learned weights and biases. The class definition still needs to exist when you load those weights.

Common traps

PyTorch tensors can hold arbitrary Python objects like NumPy arrays can.

PyTorch tensors are numeric. Text, categories, and objects need encodings, dictionaries, or embeddings.

Softmax should always be the last model layer.

For training with CrossEntropyLoss, return logits. Apply softmax only for displaying probabilities.

Calling backward() updates the model.

backward() computes gradients. optimizer.step() updates the parameters.

Gradients reset automatically.

They do not. Use optimizer.zero_grad() each update.

CPU tensors and GPU tensors can mix freely.

The model and tensors used in one operation must be on the same device.

Evaluation is just training without labels.

Evaluation should use model.eval() and no-gradient context to get correct behavior and save resources.

A DataLoader is the dataset.

The dataset defines individual examples. The dataloader defines batching, shuffling, and loading behavior.

Check yourself

What does requires_grad=True do?

It tells PyTorch to track operations on that tensor so gradients can be computed during backpropagation.

What is stored in a parameter's .grad attribute?

The gradient of the loss with respect to that parameter after loss.backward() has been called.

Why do we call optimizer.zero_grad() before loss.backward()?

PyTorch accumulates gradients by default. Clearing them prevents old gradients from affecting the current update.

What does optimizer.step() do?

It updates model parameters using the gradients and the optimizer's update rule.

Why should CrossEntropyLoss receive logits?

It combines log-softmax and negative log-likelihood internally for numerical stability.

What is the difference between Dataset and DataLoader?

A Dataset returns individual examples. A DataLoader batches, shuffles, and iterates over those examples.

Why use model.eval() during evaluation?

It switches layers such as dropout and batch normalization into evaluation behavior.

What does state_dict save?

It saves the learned parameter tensors, such as weights and biases, for the model architecture.

Source anchors

This lesson rewrites the main ideas from 11-Introduction to PyTorch.ipynb:

  • PyTorch installation and CUDA check
  • tensors, shape, dtype, and tensor operations
  • computation graphs and autograd
  • manual gradient examples with requires_grad
  • simple MLPs with nn.Sequential
  • reusable models with torch.nn.Module
  • device handling for CPU/GPU
  • Dataset and DataLoader
  • PyTorch training loop
  • inference, accuracy computation, and save/load with state_dict