Artificial Neural Networks (ANNs) are a type of machine learning (ML) model that are inspired by real, biological neural networks. An ANN is comprised of multiple layers: at a minimum, an input, hidden, and output layer. ANNs can have an arbitrary number of such hidden layers. Each layer is comprised of a certain number of units or neurons. Conceptually, each unit captures a part of the necessary computation required for the network to operate. The output of each unit is modulated by an activation function—a nonlinear function which aims to capture the concept of a neuron firing off an electrical impulse. At a high-level, ANNs accept an input vector and produce an output vector, and can thus be summarised as a real-valued vector function. ANNs form the foundational model of so-called deep learning.
Mathematically, there are several components of an ANN:
Let $\mathbf{x}$ denote the input vector and $\mathbf{\hat{y}}$ the output vector produced by the network. Let $\mathbf{y}$ denote the vector representing the true values (this becomes relevant when discussing training of a network). For convenience, let $p=|\mathbf{x}|$ denote the size of the input vector.
By convention, the input layer is typically excluded when counting the number of layers in an ANN (this makes sense, as this is simply a vector in this treatment of ANNs). So, a two-layer ANN consists of the input "layer", a single hidden layer, and the output layer. The size of a layer refers to the number of units (or, extending the biological analogy, neurons) comprising the layer. For such a network, suppose that the sizes of the hidden and output layers are, $k,c\in\mathbb{N}$, respectively. Then this network has two weight matrices, $\mathbf{W_1}\in\mathbb{R}^{p\times k}$ and $\mathbf{W_2}\in\mathbb{R}^{k\times c}$. Likewise, there are also two offset vectors, $\mathbf{b_1}\in\mathbb{R}^k$ and $\mathbf{b_2}\in\mathbb{R}^c$. Finally, there are also (in principle) two activation functions, $h_1:\mathbb{R}^p\rightarrow\mathbb{R}^k$ and $h_2:\mathbb{R}^k\rightarrow\mathbb{R}^c$. Then the output for this network is:
$$ \mathbf{\hat{y}}=h_2\left(\mathbf{W_2}h_1\left(\mathbf{W_1}\mathbf{x}+\mathbf{b_1}\right)+\mathbf{b_2}\right) $$
With $\mathbf{\hat{y}}\in\mathbb{R}^{c}$.
In general, for a $L$-layer ANN with sizes $U_1,\dots,U_{L-1}\in\mathbb{N}$, the output of the network obeys the following equation:
$$ \boxed{\mathbf{\hat{y}}=\mathbf{W}^{\left(L\right)}\mathbf{q}^{\left(L-1\right)}+\mathbf{b}^{\left(L\right)}} $$
With $\mathbf{\hat{y}}\in\mathbb{R}^{U_{L-1}}$ and where each layer is defined recursively as:
$$ \mathbf{q}^{\left(l\right)}=h^{\left(l\right)}\left(\mathbf{W}^{\left(l\right)}\mathbf{q}^{\left(l-1\right)}\mathbf{b}^{\left(l\right)}\right) $$
With the dimension of a given layer's weight vector, $\mathbf{W}^{\left(l\right)}$, being given by $U_l\times U_{l-1}\forall l\in\left[2,L-1\right]\cap\mathbb{N}$ and the offset vector, $\mathbf{b}^{\left(l\right)}$ having dimension $1\times U_l$.
The activation function of a layer within an ANN modulates the response of neurons at that layer, introducing important non-linearity into the overall regression model. Layers can have differing activation functions. Some common choices for activation functions are enumerated below.
This is the well-known logistic function (ML texts often refer to this as the sigmoid function, due to the shape of its curve).
$$ f:\mathbb{R}^n\rightarrow\mathbb{R}^n $$
$$ f\left(\mathbf{x}\right)=\frac{1}{1+e^{-\mathbf{x}}} $$
$$ \text{Softmax}:\mathbb{R}^n\rightarrow\mathbb{R}^n $$
$$ \text{Softmax}\left(\mathbf{x}\right)=\frac{e^{\mathbf{x}}}{\mathbf{1}^Te^{\mathbf{x}}}=\frac{e^{\mathbf{x}}}{\sum_j{e^{x_j}}} $$
Rectified Linear Unit (ReLU).
$$ \text{ReLU}:\mathbb{R}^n\rightarrow\left[0, \infty\right)^n $$
$$ \text{ReLU}\left(\mathbf{x}\right)=\text{max}\left(0, \mathbf{x}\right) $$
Gaussian Error Linear Unit (GELU).
$$ \text{GELU}\left(\mathbf{x}\right)=\mathbf{x}\cdot\phi\left(\mathbf{x}\right) $$
ANNs require training. The most common training algorithm for ANNs is called backpropagation. Backpropagation comprises two phases, called passes: the forward pass and the backwards pass. In the forward pass, the output of the network is computed and then the loss is calculated. In the backwards pass, the weights and biases are adjusted via gradient descent in order to minimise the future loss.
There are two steps to the forward pass:
For example, with the ReLU activation function in a 2-layer network, we have:
$$ \hat{Y}=\text{Softmax}\left(W_2\left(\text{ReLU}\left(XW_1 + b_1\right)\right) + b_2\right) $$
$$ \text{Loss}\left(\hat{Y}, Y\right)=-\sum{Y-\log{\left(\hat{Y}\right)}} $$
1: In ML literature, these are usually called bias vectors; however, these aren't true biases in the statistical sense.