Liang Bo Wang (亮亮), 2015-05-18
Orig. Stanford CS231n course by Andrej Karpathy and
Li Fei-Fei under MIT License
Adapted by Liang2 under CC 4.0 BY license
Esc to overview
← → to navigate
Far more details about DNN (CNN) are left out of this talk due to the time constraint. However, you should get the idea how to implement the core parts and be able to read the rest of the materials yourself.
Training dataset of images \(x_i \in R^D\) each with label \(y_i\) for \(N\) examples, \(i = 1 \ldots N\) and \(y_i \in 1 \ldots K\). $$f(x_i; W, b) = W x_i + b \quad\quad f: R^D \mapsto R^K$$ where parameters are \(W\) [K x D] called weights and \(b\) [K x 1] called bias vector.
In CIFAR-10 dataset, \(N\) = 50,000, \(D\) = 32 x 32 x 3 = 3072, \(K\) = 10.
$$
L_i = -\log\left(\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}\right)
$$
\(e^{f_{y_i}}\) is like probability yet being normalized. Ex. \(-\log 10 < -\log 0.01 \). Higher score leads to lower loss.
For numerical stability, in computation \(
\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}
= \frac{C \cdot e^{f_{y_i}}}{C \cdot \sum_j e^{f_j}}
= \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}
\) and let \(
\log C = -\max_j f_j
\) to normalized the exp. of the largest score to be 1.
# Vanilla Minibatch Gradient Descent
while it < max_iter:
data_batch = sampling(data, 256)
dw = gradient(loss_fun, data_batch, w)
w += - step_size * w
Stochastic Gradient Descent (SGD) when batch contains only 1 sample.
For what does the CIFAR-10 dataset look like and data preprocessing, try ænotebook 00.
For softmax linear classifier, try notebook 01.
We model the neuron by two parts: linear combination \(\sum w_i x_i\) and activation function \(f\)
Total size of parameters (weights and biases) [input x #neuron]:
Left is (3 + 1) x 4 + (4 + 1) x 2 = 26.
Right is (3 + 1) x 4 + (4 + 1) x 4 + (4 + 1) x 1 = 41.
# forward-pass of a 3-layer neural network:
f = lambda x: np.where(x > 0, x, 0) # ReLU activation
x = np.random.randn(3, 1) # input vector (3x1)
h1 = f(np.dot(W1, x) + b1) # 1st hidden layer (4x1)
h2 = f(np.dot(W2, h1) + b2) # 2nd hidden layer (4x1)
out = np.dot(W3, h2) + b3 # output neuron (1x1)
The dim of Ws, bs: W1(4x3), b1(4x1), W2(4x4), b2(4x1), W3(4x1), b3(1x1)
Note that x can be (3xN) for batch input; no act. fun. in output layer.
$$ \begin{aligned} f(x, y, z) = (x + y) z \\ q = x + y, \enspace f = qz \end{aligned} $$ $$ \begin{aligned} \Rightarrow \frac{\partial q}{\partial x} &= \frac{\partial q}{\partial y} = 1 \end{aligned} $$ $$ \begin{aligned} \Rightarrow \frac{\partial f}{\partial x} &= \frac{\partial f}{\partial q}\frac{\partial q}{\partial x} = z \cdot 1\\ \frac{\partial f}{\partial y} &= \frac{\partial f}{\partial q}\frac{\partial q}{\partial y} = z \cdot 1\\ \frac{\partial f}{\partial z} &= q \end{aligned} $$
# set some inputs
x = -2; y = 5; z = -4
# perform the forward pass
q = x + y # q = 3
f = q * z # f becomes -12
# perform the backward pass (back prop)
# in reverse order:
# first backprop through f = q * z
dfdz = q # = 3
dfdq = z # = -4
# now backprop through q = x + y
dfdx = 1.0 * dfdq # = -4, dq/dx = 1
dfdy = 1.0 * dfdq # = -4, dq/dy = 1
Think of a sigmoid function,
\(
f(w,x) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}}
\).
First compute all the building blocks,
$$
\begin{aligned}
f(x) = \frac{1}{x}
\quad &\rightarrow \quad
\frac{df}{dx} = -1/x^2
\\
f_c(x) = c + x
\quad &\rightarrow \quad
\frac{df}{dx} = 1
\\
f(x) = e^x
\quad &\rightarrow \quad
\frac{df}{dx} = e^x = f(x)
\\
f_a(x) = ax
\quad &\rightarrow \quad
\frac{df}{dx} = a
\end{aligned}
$$
Given initial w0, x0, w1, x1, w2 values (cont'd on next page)
# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X) # shape (5, 3)
# ... suppose we had the gradient on D from upper stage
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T) # X.T = X's transpose matrix
dX = W.T.dot(dD)
Now we are able to make a 2-layer fully connected (FC) NN, which has the structure: input - FC layer - ReLU - FC layer - softmax - out class
It's in notebook 02.
But I'm afraid we are almost running out of time so probably just see the final result to get the whole picture.
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position
Animations that may help your intuitions about the learning process dynamics. Left: Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Tue to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed.
Ref: CS231n Note: Neural Network 3 and Image credit: Alec Radford.
Input viewed as 32 width x 32 height x 3 channels (depth)
Output viewed as 1 x 1 x 10. Spatial information is preserved.
Receptive field: height and width of the spatially bounded input of all depths the neuron next stage takes from the former layer.
An FC layer with \(K = 4096\) looking at Conv layer of [7x7x512] volumes, using filter size \(F = 7\) gives output volume [1x1x4096].
An FC layer looks at the previous layer with \(K = 1000\) and \(F=1\) gives output [1x1x1000].
That's all you need for a ConvNet :)
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
Notebook 03 works forward/backward implementation for all types of layer.
Notebook 04 creates a CONV-RELU-POOL-FC two-layer ConvNet.