This is part of a megapost is about the 3Blue1Brown series about deep learning.
Overview
This is intended as a lightweight introduction into the topic. The motivation is about the very hard task of programatically classifying the handwritten digits.
There are many variants of neural networks.
Plain vanilla – Multilayer perceptron
Neuron: A thing that holds a number in \([0,1]\). The number inside the neuron is called ‘activation’.
All 784 neurons make up the first layer of the network.
In the last layer we have only 10 neurons which represent the output number.
There are 2 hidden layers a 16 neurons.
The Component Analogy
We hope that each middle layer represents the components of the numbers. (Note: this seems always like a claim without me having seen a proof about that at any time)
The analogy goes further. We disect the circles into single edges, and those into pixels (or the other way around seen from input to output).
How to design the activation flow
Pixel -> Edges -> Patterns -> Digits.
The task at hand is: What dials have to be turned to reliably recognize a pattern.
Take all the activations of the first layer and compute their weighted sum.
Maybe you want a bias as to when we want to read a positive value. Then we add a negative bias (threshold). To distinguish getting meaningfully active.
Notation
Because we want to standardize everthing.
All activations are represented with a column-vector, the weights are represented as a row in the weight matrix and a column-vector for the bias. The \( \sigma \) function is the above mentioned sigmoid function, getting applied to each of the \( k \) neurons in the next layer.