Backpropagation Algorithm
Posted on 17 Jan 2015, tagged algorithm
deep learning
These days I start to learn neural networks again and write some Matlab codes from scratch. I try to understand everything I do while I write the code, so I derive the equations in the back propagation while try to keep it clear and easy to understand.
Neural Network Functions
A multi layer neural network could be defined as this:
Assume \(layer_i\) means the number of neural in layer i, then the variables in the equations could be explained as below:
x
is the input, which is a row vector, it has \(layer_1\) elements.- \(w_i\) meas the weights in layer
i
, which is a matrix of \(layer_i\) rows and \(layer_{i+1}\) columns. - \(b_i\) means biases in layer i, which is a row vector of \(layer_{i+1}\) elements.
- \(a_i\) means activation function in the ith layer, which the output is a row vector, it has \(layer_i\) elements.
l
means the last layer. The output of \(a_l\) is the output of the neural network.
And \(\sigma(z)\) may be different in different use cases. This one is an example:
Gradient Descent Algorithm
We need a cost function to measure how well do we do for now. And the training of the network becomes a optimization problem. The method we use in the problem is gradient descent. Let me try to explain it.
Assume we have a cost function, and it is always non-negative. For example, this function is a good one:
Then the goal is try to make the output of the cost function smaller. Since x
and y
is fixed, the change of cost function while change w
and b
little could be shown as this:
In order to make the cost function smaller, we need to make \(\Delta C\) negative. We can make \(\Delta w = - {\partial C \over \partial w}\) and \(\Delta b = - {\partial C \over \partial b}\). So that \(\Delta C = - ({\partial C \over \partial w}) ^ 2 - ({\partial C \over \partial b}) ^ 2\) which is always negative.
Compute Partial Derivative
So the goal is to compute the partial derivative \(\partial C \over \partial w\) and \(\partial C \over \partial b\). Using equations (1), (2), (3), (5) and chain rule, we can get this:
Note that we use \(\odot\) before \(\sigma^{'}\) is because function \(\sigma(z)\) is element wise.
We can find there are many same parts in these equations: \({\partial C \over \partial a_{i+1}} \odot {\sigma^{'}(z_{i+1})}\). So we can define \(\delta_i = {\partial C \over \partial a_i} \odot {\sigma^{'}(z_i)}\), and rewrite these equations like this to avoid compute these parts many times:
With these equations, we can write the back propagation algorithm easily.