AI & Machine Learning - Unit 2¶

11. Backpropagation (Intro to Deep Learning)¶

Prof. Iacopo Masi - Computer Science Department, Sapienza University of Rome

Recap previous lecture¶

Multi-Class Classification
SoftMax Regression plus Cross-Entropy Loss
Optimization in Deep Learing with SGD over mini-batch with momentum
MLP and Fully-Connected Neural Nets
Intro to Backpropagation

Today's lecture¶

Supervised, Parametric Models¶

Propaedeutic part for Deep Learning¶

0) Go back to Backpropagation¶

1) Backpropagation with matrices and vectors (Jacobians, Gradients)¶

2) End of the course! 🤖 🦿¶

This lecture material is taken from¶

Now you see why it's named Multi-Layer Perceptron (MLP)¶

Representation of a Single Layer¶

Let's consider our linear softmax regressor

$$ \underbrace{\mbf{z}}_{\mathbb{R}^{Kx1}} = \underbrace{\mbf{W}}_{\mathbb{R}^{K\times d}}\underbrace{\mbf{x}}_{\mathbb{R}^{d\times1}} + \underbrace{\mbf{b}}_{\mathbb{R}^K}$$

We interpret as Linear Layer $\mathbf{W} \mathbf{x}+\bmf{b}$ followed by Non-Linear Activation function $\sigma$

$$ \sigma(\mathbf{W} \mathbf{x} + \bmf{b})=\sigma \circ\left(\begin{array}{cccc} w_{11} & w_{12} & \cdots & w_{1 d} \\ w_{21} & w_{22} & \cdots & w_{2 d} \\ \vdots & \cdots & \ddots & \vdots \\ w_{k 1} & w_{m 2} & \cdots & w_{k d} \end{array}\right)\left(\begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{d} \end{array}\right)+ \left(\begin{array}{c} b_{1} \\ b_{2} \\ \vdots \\ b_{k} \end{array}\right) =\sigma \circ\left(\begin{array}{c} z_{1} \\ z_{2} \\ \vdots \\ z_{k} \end{array}\right) $$

Representation of a Single Layer¶

$$ \sigma(\mathbf{W} \mathbf{x} + \bmf{b})=\sigma \circ\left(\begin{array}{cccc} \underline{w_{11}} & \underline{w_{12}} & \cdots & \underline{w_{1 d}} \\ w_{21} & w_{22} & \cdots & w_{2 d} \\ \vdots & \cdots & \ddots & \vdots \\ w_{k 1} & w_{m 2} & \cdots & w_{k d} \end{array}\right)\left(\begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{d} \end{array}\right)+ \left(\begin{array}{c} b_{1} \\ b_{2} \\ \vdots \\ b_{k} \end{array}\right) =\sigma \circ\left(\begin{array}{c} z_{1} \\ z_{2} \\ \vdots \\ z_{k} \end{array}\right) $$

Representation of a Single Layer¶

$$ \sigma(\mathbf{W} \mathbf{x} + \bmf{b})=\sigma \circ\left(\begin{array}{cccc} {w_{11}} & {w_{12}} & \cdots & {w_{1 d}} \\ w_{21} & w_{22} & \cdots & w_{2 d} \\ \vdots & \cdots & \ddots & \vdots \\ \underline{w_{k 1}} & \underline{w_{m 2}} & \cdots & \underline{w_{k d}} \end{array}\right)\left(\begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{d} \end{array}\right)+ \left(\begin{array}{c} b_{1} \\ b_{2} \\ \vdots \\ b_{k} \end{array}\right) =\sigma \circ\left(\begin{array}{c} z_{1} \\ z_{2} \\ \vdots \\ z_{k} \end{array}\right) $$

Representation of a Single Layer: Linear plus non-Linear¶

$$ \mathbf{W} \mathbf{x}=\left(\begin{array}{c} -\text { unit - } \\ \vdots \\ -\text { unit }- \end{array}\right)\left(\begin{array}{c} \mid \\ \mathbf{x} \\ \mid \end{array}\right) $$

Representation as a computational graph¶

Adding another non-linear layer before the classifier¶

We improve the expressiveness of our learned function by adding another NON-linear layer before the classification layer.
Think this new layer as a feature map $\mbf{x} \mapsto \phi(\mbf{x})$; it maps our attribute to a feature space
Now the classifier does not classify anymore directly $\mbf{x}$ but the feature $ \phi(\mbf{x})$.
Sorry, notation becomes complex. Upper script means layer index; lower-script selects the unit
$\mathbf{W}^1 \in \mathbb{R}^{d\times p}$, $\bmf{b}^1 \in \mathbb{R}^{p}$ so then $\mathbf{W}^2 \in \mathbb{R}^{p\times k}$, $\bmf{b}^2 \in \mathbb{R}^{k}$

$$\mbf{p}=\sigma(\mathbf{W}^2\underbrace{\left(\sigma(\mathbf{W}^1 \mathbf{x} + \bmf{b}^1) \right)}_{\bmf{\phi}(x)} + \bmf{b}^2)$$$$\text{dim. analysis:} \quad d \mapsto p \mapsto k$$

$\mathbf{W}^1 \in \mathbb{R}^{d\times p}$ is an Hidden Layer¶

Because it maps the original attribute in $d$ from an dimensionality $p$ and then $p$ is used for classifying.

A priori you do not know what $\mathbf{W}^1$ may learn.

$$\mbf{p}=\sigma(\mathbf{W}^2\underbrace{\left(\sigma(\mathbf{W}^1 \mathbf{x} + \bmf{b}^1) \right)}_{\bmf{\phi}(x)} + \bmf{b}^2)$$$$\text{dim. analysis:} \quad d \mapsto p \mapsto k$$

Let's update our visualizations¶

Multi-Layer Perceptron (MLP) with one hidden layer¶

Given the nature of these layers, they're called Fully-Connected NN¶

Multi-Layer Perceptron with one hidden layer¶

Non-linear activation functions: ReLu - Rectified Linear Unit¶

Very important: Activation Functions are computed element-wise.

$$ \sigma(z)= \max(0,z) \quad \text{ReLu}$$

ReLu is piece-wise linear function

Sigmoid¶

Used to model output probability
Nowadays not used in middle layers
Have to compute $\exp()$
Vanishing gradients for large input magnitude

ReLU¶

Computationally efficient (no exp!)
No vanishing gradients but do not let pass gradients for negative values
Converge much faster than sigmoid (6x)
Not differentiable in zero (subgradients)

Backpropagation and Differential Programming¶

NN can be huge composition of functions! 😱¶

Three ways of computing the gradients $\nabla_{\mbf{w}}\mathcal{L}(x,y;\mbf{w})$¶

Manually (if we change the network, we have to adjust it for a 100 layer neural net) maybe not a good idea, does not scale, even if we use symbolic derivation tools such as Mathematica ✍🏼
Finite Difference good to check the gradients once you have an automatic way of computing it; very slow, unfeasible in training! 👩🏾‍💻
Backpropagation: application of chain rule of calculus to tensors with a computational graph with caching (differential programming with automatic differentiation) 💻

1. Infeasible to derive manully the gradient update, let the machines work for us¶

2. Finite difference (very slow!) but used for gradient check¶

Assume $\mathbf{W}$ is your matrix and $w=\mathbf{W}_{ij}$ is a scalar inside your matrix.

[Offline] Evaluate your NN loss at current weight value L(w)

For i,j=1....dims:

$w=\mathbf{W}_{ij}$
You want to see what is the impact of a paramter $w$ on the loss?
Perturb that $w$ by an $\epsilon=1e-5$ and evaluate the new loss at L(w+eps)
Numerical Gradient is [L(w+eps)-L(w)]/eps at position ij, so store it in $\nabla_{\mbf{W}}\mathcal{L}_{ij}$

$$\frac{\partial\mathcal{L}}{\partial w}(x,y;w)= \frac{L(w+\epsilon) - L(w)}{\epsilon}$$

At the end you have your numerical gradients $\nabla_{\mbf{W}}\mathcal{L}_{ij}$.

3. Backpropagation¶

Let's be clear on what we need to compute¶

$\forall l \in [1\ldots,L]$:

$\nabla_{\mbf{W}^l}\mathcal{L}(\mbf{x},y;\{\mbf{W},b\})$
$\nabla_{\mbf{b}^l}\mathcal{L}(\mbf{x},y;\{\mbf{W},b\})$

Once you have gradients on ALL weights $\implies$ We can update¶

$\forall l \in [1\ldots,L]$:

$\mbf{W}^l \leftarrow \mbf{W}^l - \gamma \nabla_{\mbf{W}^l}\mathcal{L}(\mbf{x},y;\{\mbf{W},b\})$
$\mbf{b}^l \leftarrow \mbf{b}^l - \gamma \nabla_{\mbf{b}^l}\mathcal{L}(\mbf{x},y;\{\mbf{W},b\})$

How do we get all the weights?¶

Mostly taken from here

Chain Rule¶

Returning to functions of a single variable, suppose that $y = f(g(x))$ and that the underlying functions $y=f(u)$ and $u=g(x)$ are both differentiable. The chain rule states that

$$\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.$$

What is the derivative of loss wrt x in the equation below? $$y = loss\big(g(h(i(x)))\big)$$

$$\frac{\partial loss}{\partial x} = \frac{\partial loss}{\partial g} \frac{\partial g}{\partial h}\frac{\partial h}{\partial i}\frac{\partial i}{\partial x}$$

Chain Rule on Directed Acyclic Graph (DAG)¶

Automate the computation of derivatives with computer science¶

Forward Pass
Backward Pass

Forward Pass¶

This is what we wanted: $$\frac{\partial\mathcal{L}}{\partial x},\frac{\partial\mathcal{L}}{\partial y},\frac{\partial\mathcal{L}}{\partial z}$$

Forward pass

Evaluate the function on input (the function is "hard-coded" with your model/code)
Store "local" derivative at each layer/gate

$$\frac{\partial\mathcal{L}}{\partial q}=z, \frac{\partial\mathcal{L}}{\partial z}=q, \frac{\partial\mathcal{q}}{\partial x}=1, \frac{\partial\mathcal{q}}{\partial y}=1$$

Backward Pass¶

What is the value of the gradient of $\mathcal{L}$ on $y$?: $$\frac{\partial\mathcal{L}}{\partial y} = \frac{\partial\mathcal{L}}{\partial q}\frac{\partial q}{\partial y} = z\cdot 1= -4$$

This is what we wanted: $$\frac{\partial\mathcal{L}}{\partial x},\frac{\partial\mathcal{L}}{\partial y},\frac{\partial\mathcal{L}}{\partial z}$$

Backward pass

Start from loss scalar value
Backpropagate the current derivative/gradient to higher layers
Use chain rule to aggregate a) local derivative b) what arrives from "the top"

$$\frac{\partial\mathcal{L}}{\partial q}=z, \frac{\partial\mathcal{L}}{\partial z}=q, \frac{\partial\mathcal{q}}{\partial x}=1, \frac{\partial\mathcal{q}}{\partial y}=1$$

Backward Pass¶

This is what we wanted: $$\frac{\partial\mathcal{L}}{\partial x},\frac{\partial\mathcal{L}}{\partial y},\frac{\partial\mathcal{L}}{\partial z}$$

This is what we have

$$\frac{\partial\mathcal{L}}{\partial q}=z, \frac{\partial\mathcal{L}}{\partial z}=q, \frac{\partial\mathcal{q}}{\partial x}=1, \frac{\partial\mathcal{q}}{\partial y}=1$$

Check with our manual derivation ✅¶

The high school way (as we did until now):

$$\frac{\partial\mathcal{L}(x,y,z)}{\partial x} = (\mbf{x}z+yz)^{\prime}=(\mbf{x}z)^{\prime}+(yz)^{\prime} = z = -4$$$$\frac{\partial\mathcal{L}(x,y,z)}{\partial y} = (xz+\mbf{y}z)^{\prime}=(xz)^{\prime}+(\mbf{y}z)^{\prime} = z = -4$$$$\frac{\partial\mathcal{L}(x,y,z)}{\partial z} = x+y = +3 $$

You know what? I do not trust math, I want to verify with a machine ✅¶

Pytorch check¶

from torch import tensor

def neural_net(x,y,z):
    return (x+y)*z

x, y, z = tensor(-2., requires_grad=True), tensor(5.,requires_grad=True), tensor(-4., requires_grad=True)
loss = neural_net(x,y,z) # forward pass
loss.backward()          # backward (after this I can check the gradients)
for el in [x,y,z]:
    print(el.grad)

tensor(-4.)
tensor(-4.)
tensor(3.)

Pytorch check¶

Pytorch creates a dynamic computational directed acyclic graph (DAG) under the hood.

We can also see the graph with Pytorch if you want (torchviz simplifies the plot)
! pip install torchviz

In [4]:

from torchviz import make_dot
from torch import tensor
def neural_net(x,y,z):
    return (x+y)*z

x, y, z = tensor(-2., requires_grad=True), tensor(5.,
                                                  requires_grad=True), tensor(-4., requires_grad=True)
loss = neural_net(x, y, z)  # forward pass

loss.backward()  # backward (ok now I can check the gradients)
for el in [x, y, z]:
    print(el.grad)
print(loss)
make_dot(loss, params=dict([('x', x),('y', y),('z', z)]))

tensor(-4.)
tensor(-4.)
tensor(3.)
tensor(-12., grad_fn=<MulBackward0>)

Out[4]:

We can also see the DAG of AlexNet... [paper from 2012]¶

...but it won't fit my screen¶

In [ ]:

import torch
from torchvision.models import AlexNet
model = AlexNet()
x = torch.randn(1, 3, 227, 227).requires_grad_(True)
y = model(x)
make_dot(y, params=dict(list(model.named_parameters()) + [('x', x)]))

General Recipe for Chain Rule over DAGs [Forward]¶

Just remember that you have to do at a generic gate:

General Recipe for Chain Rule over DAGs [Backward]¶

Multiply the gradient that you receive with your local gradient

Exam Question Lookalike¶