Computer Science
Algorithm
Data Processing
Digital Life
Distributed System
Distributed System Infrastructure
Machine Learning
Backpropagation Algorithm (2015)
Operating System
Android
Linux
MacOS
Tizen
Windows
iOS
Programming Language
C++
Erlang
Go
Scala
Scheme
Type System
Software Engineering
Storage
UI
Flutter
Javascript
Virtualization
Life
Life in Guangzhou (2013)
Recent Works (2013)
东京之旅 (2014)
My 2017 Year in Review (2018)
My 2020 in Review (2021)
十三年前被隔离的经历 (2022)
A Travel to Montreal (2022)
My 2022 in Review (2023)
Travel Back to China (2024)
A 2-Year Reflection for 2023 and 2024 (2025)
Projects
Bard
Blog
RSS Brain
Scala2grpc
Comment Everywhere (2013)
Fetch Popular Erlang Modules by Coffee Script (2013)
Psychology
耶鲁大学心理学导论 (2012)
Thoughts
Chinese
English

Backpropagation Algorithm

Posted on 17 Jan 2015, tagged algorithmdeep learning

These days I start to learn neural networks again and write some Matlab codes from scratch. I try to understand everything I do while I write the code, so I derive the equations in the back propagation while try to keep it clear and easy to understand.

Neural Network Functions

A multi layer neural network could be defined as this:

\begin{equation} a_1(x, w, b) = x \\ \end{equation} \begin{equation} z_i(x, w, b) = a_{i-1}(x, w, b) \cdot w_{i-1} + b_{i-1} \end{equation} \begin{equation} a_i(x, w, b) = \sigma(z_i(x, w, b)) \end{equation}

Assume \(layer_i\) means the number of neural in layer i, then the variables in the equations could be explained as below:

  • x is the input, which is a row vector, it has \(layer_1\) elements.
  • \(w_i\) meas the weights in layer i, which is a matrix of \(layer_i\) rows and \(layer_{i+1}\) columns.
  • \(b_i\) means biases in layer i, which is a row vector of \(layer_{i+1}\) elements.
  • \(a_i\) means activation function in the ith layer, which the output is a row vector, it has \(layer_i\) elements.
  • l means the last layer. The output of \(a_l\) is the output of the neural network.

And \(\sigma(z)\) may be different in different use cases. This one is an example:

\begin{equation} \sigma(z) = {1 \over 1 + e^{-z}} \end{equation}

Gradient Descent Algorithm

We need a cost function to measure how well do we do for now. And the training of the network becomes a optimization problem. The method we use in the problem is gradient descent. Let me try to explain it.

Assume we have a cost function, and it is always non-negative. For example, this function is a good one:

\begin{equation} C(x, y, w, b) = {1 \over 2} (y - a_l(x, w, b)) ^ 2 \end{equation}

Then the goal is try to make the output of the cost function smaller. Since x and y is fixed, the change of cost function while change w and b little could be shown as this:

\begin{equation} \Delta C = {\partial C \over \partial w} \Delta w + {\partial C \over \partial b} \Delta b \end{equation}

In order to make the cost function smaller, we need to make \(\Delta C\) negative. We can make \(\Delta w = - {\partial C \over \partial w}\) and \(\Delta b = - {\partial C \over \partial b}\). So that \(\Delta C = - ({\partial C \over \partial w}) ^ 2 - ({\partial C \over \partial b}) ^ 2\) which is always negative.

Compute Partial Derivative

So the goal is to compute the partial derivative \(\partial C \over \partial w\) and \(\partial C \over \partial b\). Using equations (1), (2), (3), (5) and chain rule, we can get this:

\begin{equation} {\partial C \over \partial a_l} = a_l - y \end{equation} \begin{equation} {\partial C \over \partial a_i} = {\partial C \over \partial a_{i+1}} {\partial a_{i+1} \over \partial \sigma} {\partial \sigma \over \partial z_{i+1}} {\partial z_{i+1} \over \partial a_i} = {\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})} w_i^{'} \end{equation} \begin{equation} {\partial C \over \partial w_i} = {\partial C \over \partial a_{i+1}} {\partial a_{i+1} \over \partial \sigma} {\partial \sigma \over \partial z_{i+1}} {\partial z_{i+1} \over \partial w_i} = a_i^{'} {\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})} \end{equation} \begin{equation} {\partial C \over \partial b_i} = {\partial C \over \partial a_{i+1}} {\partial a_{i+1} \over \partial \sigma} {\partial \sigma \over \partial z_{i+1}} {\partial z_{i+1} \over \partial b_i} = {\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})} \end{equation}

Note that we use \(\odot\) before \(\sigma^{'}\) is because function \(\sigma(z)\) is element wise.

We can find there are many same parts in these equations: \({\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})}\). So we can define \(\delta_i = {\partial C \over \partial a_i} \odot {\sigma^{'}(z_i)}\), and rewrite these equations like this to avoid compute these parts many times:

\begin{equation} \delta_l = (a_l - y) \odot \sigma^{'}(z_l) \end{equation} \begin{equation} \delta_i = \delta_{i+1} w_i^{'} \odot \sigma^{'}(z_i) \end{equation} \begin{equation} {\partial C \over \partial w_i} = a_i^{'} \delta_{i+1} \end{equation} \begin{equation} {\partial C \over \partial b_i} = \delta_{i+1} \end{equation}

With these equations, we can write the back propagation algorithm easily.