Chapter 4 - Multilayer Perceptron (MLP)

Cover#

A scene that left a deep impression~~

002frieren

Introduction to MLP#

Linear models have many limitations ~~(obviously, not discussed here)~~. To overcome the limitations of linear models, hidden layers can be added to the network to enable it to handle more general function relationships. The simplest method is to stack many fully connected layers together, with each layer outputting to the layer above it, until the final output is generated. We can consider the first $L-1$ layers as representations and the last layer as a linear predictor. This architecture is commonly known as a multilayer perceptron (MLP).

Since the composition of affine functions is still an affine function (obviously), simply stacking more linear layers without any other transformations is meaningless for nonlinear learning, ~~but it increases the number of parameters (although not completely meaningless, haha)~~.

The advantage of MLP lies in applying a non-linear activation function activation function $\sigma$ to each hidden unit after the affine transformation. The output of the activation function is called activations. With the activation function, it is generally not possible to reduce our multilayer perceptron to a linear model.

Structure of MLP#

No need to say more!

\begin{aligned} \mathbf{H} & = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}), \\ \mathbf{O} & = \mathbf{H}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}.\\ \end{aligned}

After applying a non-linear activation function to each hidden layer, stacking the hidden layers makes sense (learning more complex nonlinear relationships and thus having more powerful expressive power).

\begin{aligned} \mathbf{H}^{(1)} &= \sigma_1(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\\ \mathbf{H}^{(2)} &= \sigma_2(\mathbf{H}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)})\\ &\cdots \end{aligned}

Hidden layers can be widened/deepened, but there must be a trade-off ~~(omitted)~~.

Activation Functions#

The first one to appear is the famous $\operatorname{ReLU}$ function, although it looks very simple, it is indeed one of the most widely used activation functions.
Very straightforward, it turns negative values into 0 and copies positive values.

$\operatorname{ReLU}(x) = \max(x, 0).$

~~Function graph is omitted~~

But there is a problem, it has a discontinuity in the derivative value at $x=0$ (which is very annoying, in fact, all boundary conditions are annoying), so we default the derivative value at 0 to be 0, haha, ~~take an indifferent attitude~~.

It's a bit like a rectifier diode in a circuit (the signal either passes through or blocks). More importantly, it solves the problem of gradient vanishing in neural networks.

"Can we keep a small part of the signal passing through by opening the network?" So we have its improved version $\operatorname{pReLU}$ :
$\operatorname{pReLU}(x) = \max(0, x) + \alpha \min(0, x).$

The second one to appear is the $\operatorname{sigmoid}$ function, $\operatorname{sigmoid}:\mathbb{R}\to(0,1)$

\operatorname{sigmoid}(x) = \frac{1}{1 + \exp(-x)}.

However, sigmoid is less used in hidden layers and is mostly replaced by the simpler and easier-to-train ReLU. Sigmoid units are used in recurrent neural networks to control the flow of temporal information. When the input is close to 0, the sigmoid function approximates a linear transformation.

\frac{d}{dx} \operatorname{sigmoid}(x) = \frac{\exp(-x)}{(1 + \exp(-x))^2} = \operatorname{sigmoid}(x)\left(1-\operatorname{sigmoid}(x)\right).

~~(This is actually a nice property)~~

The last one to be introduced is the $\operatorname{tanh}$ function

\operatorname{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}.

\frac{d}{dx} \operatorname{tanh}(x) = 1 - \operatorname{tanh}^2(x).

~~(This is also a nice property)~~

(False) Conclusion#

Of course, there will be more to add later, but let's stop here for now. The code part will be written separately. Bye~