Chapter 3 - Linear Models

Introduction#

I have previously read "Hands-On Machine Learning with MXNet" by Li Mu and others, and found it to be excellent. However, I did not pay enough attention to summarizing and practicing, and I feel guilty for not making the most of such high-quality learning materials. Therefore, I am writing this article as a summary of my learning journey, and I hope it can help those in need.
Due to the length, only the theoretical part is summarized.

Exercise solutions can be found at https://datawhalechina.github.io/d2l-ai-solutions-manual/#/

Linear Regression#

The model $\hat{y} = \mathbf{w}^\top \mathbf{x} + b$ , where $\mathbf{w} \in \mathbb{R}^d$ and $\mathbf{x} \in \mathbb{R}^d$ is the feature of a single data sample.
For the feature set $\mathbf{X}$ , ${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$ gives the prediction vector.

The loss function uses the mean squared error:

l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.

After summing and averaging over the number of samples:

L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.

Therefore, we hope to find $\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b)$ , and the analytical solution is to minimize $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ , which has a derivative of $\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}$ .

SGD#

For more general problems, we use gradient descent , which updates in the direction of decreasing gradient. In practical applications, for computational efficiency, we randomly sample a small batch of samples, called minibatch stochastic gradient descent. The update process is as follows:

\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b),\\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) . \end{aligned}

For the linear model mentioned above, it can be expressed as:

\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}

Minimizing Mean Squared Error#

When the noise follows a Gaussian distribution, minimizing mean squared error is equivalent to maximum likelihood estimation for the linear model. Therefore, minimizing mean squared error is reasonable. The proof is as follows:

When $y = \mathbf{w}^\top \mathbf{x} + b + \epsilon,$ $\epsilon \sim \mathcal{N}(0, \sigma^2)$ ,

\begin{aligned} P(y \mid \mathbf{x}) &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).\\ P(\mathbf y \mid \mathbf X) &= \prod_{i=1}^{n} p(y^{(i)}|\,\mathbf{x}^{(i)}). \end{aligned}

Substituting in:

\begin{aligned} &\text{argmin}\,\{-\log P(\mathbf y \mid \mathbf X)\} \\= &\text{argmin}\,\{\sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2\}. \end{aligned}

The first term is a constant, and the second term's constant coefficient can be ignored. Therefore, it can be seen that minimizing mean squared error is equivalent to maximum likelihood estimation for the linear model when the noise follows a Gaussian distribution.

Classification - Introduction#

Mainly used to solve classification problems.

First, for the encoding of the output of the classification problem, we use one-hot vector encoding. For example, for the characters "Florian", "Xutaerke", and "Feren", we use $(1,0,0)$ to encode "Florian", $(0,1,0)$ to encode "Xutaerke", and $(0,0,1)$ to encode "Feren".
We do not use natural numbers like 1, 2, 3 for encoding because natural numbers are linear, and this linearity does not naturally exist in classification problems (for example, we cannot say that "Florian" looks more like "Feren" and "Xutaerke", but using natural numbers to encode implies this to some extent).

Second, for the network structure of the classification problem, it adopts a single-layer network similar to the linear model, but the difference is that the output layer of the classification problem is a vector (rather than a real number). It is called the (unnormalized) prediction logit.

For a single data, the matrix multiplication form is $\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$ . Where $\mathbf{W}$ and $\mathbf{b}$ are the parameters to be learned.
~~For example, for a 4-feature 3-classification problem, $\mathbf{x}$ is a 4x1 matrix, $\mathbf{W}$ is a 3x4 matrix, and $\mathbf{b}$ and $\mathbf{o}$ are both 3x1 matrices.~~

~~It can be seen that the fully connected layer is computationally expensive. An article pointed out a method to save computational cost ("Others").~~

Classification - Softmax Regression#

Now that we have the output $\mathbf{o}$ , however, since $\mathbf{o}$ does not satisfy the axioms of probability, we use the so-called softmax operation to normalize and calibrate it.

\begin{aligned} \hat{\mathbf{y}} &= \mathrm{softmax}(\mathbf{o})\\ \hat{y}_j &= \frac{\exp(o_j)}{\sum_k \exp(o_k)} \end{aligned}

Therefore, $\hat{\mathbf{y}}$ can be regarded as "probability prediction output". From the definition, it can be seen that the softmax operation preserves the order of magnitude, so we can find the most likely category by directly finding the maximum $o_j$ .

We still call softmax a linear model because its parameters are concentrated in the linear part. This is easy to understand.

In actual GPU calculations, a vectorized calculation called vectorization is used.

\begin{aligned} &\mathbf{X}\in \mathbb{R}^{n\times d}\quad\mathbf{W}\in \mathbb{R}^{d\times q}\quad \mathbf{b}\in \mathbb{R}^{1\times q}i\\& \mathbf{O}\in \mathbb{R}^{n\times q}\quad \mathbf{Y}\in \mathbb{R}^{n\times q}\\\\ &\mathbf{O}=\mathbf{X}\mathbf{W}+\mathbf{b}\\ &\hat{\mathbf{Y}}=\text{softmax}(\mathbf{O}) \end{aligned}

Batch vectorization speeds up the matrix-vector multiplication of $\mathbf{X}$ and $\mathbf{W}$ .

Similarly, using the logarithmic likelihood method, we minimize the negative logarithmic likelihood. The difference is that the loss function is

l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j.

This is called cross-entropy loss, which is one of the commonly used loss functions for classification problems.

The entropy of a distribution $P$ is defined as $H[P] = \sum_j - P(j) \log P(j)$ . One of the fundamental theorems of information theory states that to encode data randomly drawn from a distribution $P$ , we need at least $H[P]$ nats to encode it.

If we can easily predict the next data, then the data is easy to compress. If we cannot predict every event completely, then we may sometimes feel "surprised". We can think of cross-entropy as the "expected surprise" of an observer with subjective probability $Q$ when seeing data generated according to probability $P$ . When $P=Q$ , the cross-entropy from $P$ to $Q$ is $H(P, P)= H(P)$ .

Derivative of Softmax#

\begin{aligned} l(\mathbf{y}, \hat{\mathbf{y}}) &= - \sum_{j=1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} \\ &= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j\\ &= \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j. \end{aligned}

\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j.

Others#

Introduction to the concept of layers
- The input dimensionality (number of inputs) is d, and the output dimensionality is 1. Since the focus of the model is on where the computation occurs, the number of layers is usually considered based on the output layer rather than the input layer. The linear regression model can be viewed as a neural network consisting of a single artificial neuron, or a single-layer neural network.
- For linear regression, each input is connected to each output, which is called a fully connected layer or dense layer.
Some interesting biological foundations, but they are omitted here. Here are some interesting English words I found.
Dendrites (input terminals)
Nucleus (CPU)
Axon (output wire)
Axon terminal (output terminal)
Synapse
Mentioned an article about saving the cost of fully connected layers, I will take a look at it when I have time and update a blog post.
https://openreview.net/pdf?id=rcQdycl0zyk
BEYOND FULLY-CONNECTED LAYERS WITH QUATERNIONS: PARAMETERIZATION OF HYPERCOMPLEX MULTIPLICATIONS WITH 1/n PARAMETERS
Duncan Luce, a social scientist, invented the softmax function based on the theoretical foundation of choice models in 1959.

Exercise Summary#

When might it be better to use the analytical solution than stochastic gradient descent (SGD)? When does this method fail?

Answer: When the dataset is small, the analytical solution may be better than stochastic gradient descent. However, computing the analytical solution can be very time-consuming or there may be multiple local minima in large datasets. Additionally, when the matrix $\mathbf X^\top \mathbf X$ is not invertible, the analytical solution does not exist. In this case, regularization or numerical optimization methods are needed. (Note: $\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}$ .)

Assuming that the noise model that controls the additional noise $\epsilon$ is an exponential distribution, i.e., $p(\epsilon) = \frac{1}{2} \exp(-|\epsilon|)$ , the analytical solution for the negative log-likelihood under the model $-\log P(\mathbf y \mid \mathbf X)$ does not exist, and using SGD, the gradient will not smoothly tend to zero near the stationary point, but will have a mutation.

The solution is to
- Use a smooth loss function, such as MSE, Smooth L1 loss, etc.
- Adjust the learning rate, gradually reducing the learning rate to make the parameter updates near the stationary point more stable.
- Use momentum or adaptive learning rate optimization algorithms.

What happens if we initialize the weights to zero? Does the algorithm still work?

If the weights are initialized to zero, then the output of each neuron will be the same, which means that each neuron will learn the same parameters. (~~Why?~~) Therefore, each neuron will update the same parameters, resulting in all neurons learning the same features. Therefore, initializing the weights to zero will render the algorithm ineffective. This loses the advantage of neural networks, which is the ability to learn different features.
Logistic regression and neural networks have different weight initialization methods. For logistic regression, weights can be initialized to zero because it is a linear model, and the gradient descent algorithm can still update them. However, for neural networks, initializing the weights to zero may cause symmetry problems and prevent hidden units from learning different features. Therefore, it is best to use random or other methods to initialize the weights of neural networks.